About

Goto LAST

“Copy/Paste is the mother of learning.”
“Repetition! Repetition is the mother of learning.”

Sources: GitHub | Google Drive | OneDrive

Environment

Assumption: Working directory has sub-folders named "data", "images", "code", "docs".

R Version

# #R Version
R.version.string
## [1] "R version 4.1.2 (2021-11-01)"

Working Directory

# #Working Directory
getwd()
## [1] "D:/Analytics/xADSM"

Session Info

# #Version information about R, the OS and attached or loaded packages
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252    LC_MONETARY=English_India.1252
## [4] LC_NUMERIC=C                   LC_TIME=English_India.1252    
## 
## attached base packages:
## [1] compiler  grid      stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] Hmisc_4.6-0          Formula_1.2-4        survival_3.2-13      rfm_0.2.2           
##  [5] ggrepel_0.9.1        arulesViz_1.5-1      arules_1.7-3         cluster_2.1.2       
##  [9] factoextra_1.0.7     stringi_1.7.6        mlbench_2.1-3        glmnet_4.1-3        
## [13] Matrix_1.4-0         caret_6.0-90         lattice_0.20-45      RColorBrewer_1.1-2  
## [17] fastDummies_1.6.3    psych_2.1.9          scales_1.1.1         viridisLite_0.4.0   
## [21] corrplot_0.92        GGally_2.1.2         microbenchmark_1.4.9 ggpmisc_0.4.5       
## [25] ggpp_0.4.3           qcc_2.7              VIM_6.1.1            colorspace_2.0-2    
## [29] mice_3.14.0          nortest_1.0-4        Lahman_9.0-0         gapminder_0.3.0     
## [33] nycflights13_1.0.2   gifski_1.4.3-1       data.table_1.14.2    zoo_1.8-9           
## [37] car_3.0-12           carData_3.0-5        lubridate_1.8.0      e1071_1.7-9         
## [41] latex2exp_0.9.3      readxl_1.3.1         kableExtra_1.3.4     forcats_0.5.1       
## [45] stringr_1.4.0        dplyr_1.0.7          purrr_0.3.4          readr_2.1.2         
## [49] tidyr_1.2.0          tibble_3.1.6         ggplot2_3.3.5        conflicted_1.1.0    
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.4.1      systemfonts_1.0.3    igraph_1.2.11        plyr_1.8.6          
##   [5] sp_1.4-6             splines_4.1.2        listenv_0.8.0        digest_0.6.29       
##   [9] foreach_1.5.2        htmltools_0.5.2      viridis_0.6.2        fansi_1.0.2         
##  [13] checkmate_2.0.0      magrittr_2.0.2       memoise_2.0.1        tzdb_0.2.0          
##  [17] graphlayouts_0.8.0   recipes_0.1.17       globals_0.14.0       gower_0.2.2         
##  [21] svglite_2.0.0        jpeg_0.1-9           rvest_1.0.2          xfun_0.29           
##  [25] crayon_1.4.2         jsonlite_1.7.3       iterators_1.0.13     glue_1.6.1          
##  [29] polyclip_1.10-0      gtable_0.3.0         ipred_0.9-12         webshot_0.5.2       
##  [33] MatrixModels_0.5-0   future.apply_1.8.1   shape_1.4.6          DEoptimR_1.0-10     
##  [37] abind_1.4-5          SparseM_1.81         DBI_1.1.2            Rcpp_1.0.8          
##  [41] htmlTable_2.4.0      laeken_0.5.2         tmvnsim_1.0-2        foreign_0.8-82      
##  [45] proxy_0.4-26         stats4_4.1.2         lava_1.6.10          prodlim_2019.11.13  
##  [49] vcd_1.4-9            htmlwidgets_1.5.4    httr_1.4.2           ellipsis_0.3.2      
##  [53] farver_2.1.0         pkgconfig_2.0.3      reshape_0.8.8        nnet_7.3-17         
##  [57] sass_0.4.0           utf8_1.2.2           tidyselect_1.1.1     rlang_1.0.1         
##  [61] reshape2_1.4.4       munsell_0.5.0        cellranger_1.1.0     tools_4.1.2         
##  [65] cachem_1.0.6         cli_3.1.1            generics_0.1.2       ranger_0.13.1       
##  [69] broom_0.7.12         evaluate_0.14        fastmap_1.1.0        yaml_2.2.2          
##  [73] ModelMetrics_1.2.2.2 knitr_1.37           tidygraph_1.2.0      robustbase_0.93-9   
##  [77] ggraph_2.0.5         future_1.23.0        nlme_3.1-155         quantreg_5.87       
##  [81] xml2_1.3.3           rstudioapi_0.13      png_0.1-7            tweenr_1.0.2        
##  [85] bslib_0.3.1          vctrs_0.3.8          pillar_1.7.0         lifecycle_1.0.1     
##  [89] lmtest_0.9-39        jquerylib_0.1.4      latticeExtra_0.6-29  R6_2.5.1            
##  [93] bookdown_0.24        gridExtra_2.3        parallelly_1.30.0    codetools_0.2-18    
##  [97] boot_1.3-28          MASS_7.3-55          assertthat_0.2.1     withr_2.4.3         
## [101] mnormt_2.0.2         parallel_4.1.2       hms_1.1.1            rpart_4.1.16        
## [105] timeDate_3043.102    class_7.3-20         rmarkdown_2.11       ggforce_0.3.3       
## [109] pROC_1.18.0          base64enc_0.1-3

Pandoc

# #Pandoc Version being used by RStudio
rmarkdown::pandoc_version()
## [1] '2.14.0.3'

Aside

I wanted to have a single document containing Notes, Codes, & Output for a quick reference for the lectures. Combination of multiple file formats (docx, csv, xlsx, R, png etc.) was not working out for me. So, I found the Bookdown package to generate this HTML file.

All of us had to stumble through some of the most common problems individually and as we are approaching deeper topics, a more collaborative approach might be more beneficial.

Further, the lectures are highly focused and thus I had to explore some sidetopics in more details to get the most benefit from them. I have included those topics and I am also interested in knowing about your experiences too.

Towards that goal, I am sharing these notes and hoping that you would run the code in your own environment and would raise any queries, problems, or difference in outcomes. Any suggestion or criticism is welcome. I have tried to not produce any significant changes in your working environment. Please let me know if you observe otherwise.

Currently, my priority is to get in sync with the ongoing lectures. The time constraint has led to issues given below. These will be corrected as and when possible.

  • Tone of the document may be a little abrupt, please overlook that
  • Source references are not added as much as I wanted to (from where I copy/pasted learned) and there is no easy solution for this.
  • I have NOT explained some of the functions before their usage (lapply(), identical() etc.) Hyperlinks for that will be added as and when those topics are covered
  • Code has been checked only on Windows 10. For Mac or Linux, if when you find something that has different output or behaviour, please let me know
  • Although these notes are generated using R Markdown and Bookdown, I have not yet covered these. If you need any help in creating your own notes, please let me know. If I have the solution for your problem, I will share.

Last, but not the least, I am also learning while creating this, so if you think I am wrong somewhere, please point it out. I am always open for suggestions.

Thank You all for the encouragement.

Shivam


(B01)


(B02)


(B03)


(B04)


(B05)


(B06)


(B07)


(B08)


1 R Introduction (B09, Aug-31)

1.1 R Basics

R is Case-sensitive i.e. c() not C() and View() not view()

Hash Sign “#” comments out anything after it, till the newline. There are no multiline comments.

Backslash “\” is reserved to escape the character that follows it.

Escape key stops the parser i.e. “+” sign where R is waiting for more input before evaluation.

Overview

1.1.1 R Studio

  • There are 4 Panes -
    1. Top Left - R Editor, Source
    2. Bottom Left - Console, Terminal, …
    3. Top Right - Environment, History, …
    4. Bottom Right - Plots, …
  • Sometimes there are only 3 panes i.e. Editor Pane is missing
    • To Open Editor Pane - Create a New R Script by “File | New File | R Script” or "Ctrl+ Shift+ N"
  • To Modify Pane Settings - Tools | Global Options | Pane Layout

1.1.2 Shortcuts

  • Execute the current expression in Source Pane (Top): “Ctrl+ Enter”
  • Execute the current expression in Console Pane (Bottom): “Enter”
  • Clear the Console Pane (Bottom): “Ctrl+ L”
  • Restart the Current R Session: “Ctrl+ Shift+ F10”
  • Create a New R Script: “Ctrl+ Shift+ N”
  • Insert ” <- ” i.e. Assignment Operator with Space: “Alt+ -”
  • Insert ” %>% ” i.e. Pipe Operator with Space: “Ctrl+ Shift+ M”
  • Comment or Uncomment Lines: “Ctrl+ Shift+ C”
  • Set Working Directory: “Ctrl+ Shift+ H”
  • Search Command History: “Ctrl+ Up Arrow”
  • Search Files: “Ctrl+ .”

1.1.3 Executing an Expression

Execute the current expression in Source Pane (Top) by ‘Run’ Button or "Ctrl+ Enter"

Execute the current expression in Console Pane (Bottom) by “Enter”

1.1.4 PATH and Working Directory

Windows 10 uses backslash “\” for PATH. R, however, uses slash “/.” Backslash “\” is escape character in R.

  • So, To provide “C:\Users\userName\Documents” as PATH
    • Use: “C:\\Users\\userName\\Documents”
    • OR: “C:/Users/userName/Documents”
    • OR: “~” Tilde acts as a Reference to Home Directory

In R Studio, Set Working Directory by:

  • Session | Set Working Directory | Choose Directory or "Ctrl+ Shift+ H"
# #Current Working Directory
getwd()
## [1] "D:/Analytics/xADSM"
#
# #R Installation Directory (Old DOS Convention i.e. ~1 after 6 letters)
R.home()
## [1] "C:/PROGRA~1/R/R-41~1.2"
Sys.getenv("R_HOME") 
## [1] "C:/PROGRA~1/R/R-41~1.2"
#
# #This is Wrapped in IF Block to prevent accidental execution
if(FALSE){
# #WARNING: This will change your Working Directory
  setwd("~")
}

1.1.5 Printing

If the R program is written over the console, line by line, then the output is printed automatically i.e. no function needed for printing. This is called implicit printing.

Inside an R Script File, implicit printing does not work and the expression needs to be printed explicitly.

In R, the most common method to print the output ‘explicitly’ is by the function print().

# #Implicit Printing: This will NOT be printed to Console, if it is inside an R Script.
"Hello World!"
#
# #Implicit Printing using '()': Same as above
("Hello World!")
#
# #Explicit Printing using print() : To print Objects to Console, even inside an R Script.
print("Hello World!")
## [1] "Hello World!"

1.2 Objects

1.2.1 List ALL Objects

Everything that exists in R is an object in the sense that it is a kind of data structure that can be manipulated. Expressions for evaluation are themselves objects; Evaluation consists of taking the object representing an expression and returning the object that is the value of that expression.

# #ls(): List ALL Objects in the Current NameSpace (Environment)
ls()
## character(0)

1.2.2 Assign a Value to an Object

Caution: Always use “<-” for the assignment, NOT the “=”

While the “=” can be used for assignment, its usage for assignment is highly discouraged because it may behave differently under certain subtle conditions which are difficult to debug. Convention is to use “=” only during function calls for arguments association (syntactic token).

There are 5 assignment operators (<-, =, <<-, ->, ->>), others are not going to be discussed for now.

All the created objects are listed in the Environment Tab of the Top Right Pane.

# #Assignment Operator "<-" is used to assign any value (ex: 10) to any object (ex: 'bb')
bb <- 10
#
# #Print Object
print(bb)
## [1] 10

1.2.3 Remove an Object

In the Environment Tab, any object can be selected and deleted using Brush.

# #Trying to Print an Object 'bb' and Handling the Error, if thrown
tryCatch(print(bb), error = function(e) print(paste0(e)))
## [1] 10
#
# #Remove an Object
rm(bb)
#
# #Equivalent
if(FALSE) {rm("bb")} #Same
if(FALSE) {rm(list = "bb")} #Faster, verbose, and would not work without quotes
#
# #Trying to Print an Object 'bb' and Handling the Error, if thrown
tryCatch(print(bb), error = function(e) print(paste0(e)))
## [1] "Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'print': object 'bb' not found\n"

1.3 Data

23.1 Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.

23.2 Elements are the entities on which data are collected. (Generally ROWS)

23.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)

23.4 The set of measurements obtained for a particular element is called an observation.

23.5 Statistics is the art and science of collecting, analysing, presenting, and interpreting data.

1.4 Vectors

R has 6 basic data types (logical, integer, double, character, complex, and raw). These data types can be combined to form Data Structures (vector, list, matrix, dataframe, factor etc.). Refer What is a Vector!

Definition 1.1 Vectors are the simplest type of data structure in R. A vector is a sequence of data elements of the same basic type.
Definition 1.2 Members of a vector are called components.

Atomic vectors are homogeneous i.e. each component has the same datatype. A vector type can be checked with the typeof() or class() function. Its length, i.e. the number of elements in the vector, can be checked with the function length().

If the output of an expression does not show numbers in brackets like ‘[1]’ then it is a ’NULL’ type return. [Numbers] show that it is a Vector. Ex: str() and cat() outputs are of NULL Type.

Use function c() to create a vector (or a list) -

  • In R, a literal character or number is just a vector of length 1. So, c() ‘combines’ them together in a series of 1-length vectors.
  • c() neither creates nor concatenates the vectors, it combines them. Thus, it combines list into a list and vectors into a vector.
  • In R, list is a ‘Vector’ but not an ‘Atomic Vector.’
  • All arguments are coerced to a common type which is the type of the returned value.
  • All attributes (e.g. dim) except ‘names’ are removed.
  • The output type is determined from the highest type of the components in the hierarchy NULL < raw < logical < integer < double < complex < character < list < expression.
  • To “index a vector” means, to address specific elements by using square brackets, i.e. x10 means the \({10^{th}}\) element of vector ‘x.’

Caution: Colon “:” might produce unexpected length of vectors (in case of 0-length vectors). Suggestion: Use colon only with hardcoded numbers i.e. “1:10” is ok, “1:n” is dangerous and should be avoided.

Caution: seq() function might produce unexpected type of vectors (in case of 1-length vectors). Suggestion: Use seq_along(), seq_len().

Atomic Vectors

# #To know about an Object: str(), class(), length(), dim(), typeof(), is(), attributes(), names()
# #Integer: To declare as integer "L" (NOT "l") is needed
ii_int <- c(1L, 2L, 3L, 4L, 5L)
str(ii_int)
##  int [1:5] 1 2 3 4 5
#
# #Double (& Default)
dd_dbl <- c(1, 2, 3, 4, 5)
str(dd_dbl)
##  num [1:5] 1 2 3 4 5
#
# #Character
cc_chr <- c('a', 'b', 'c', 'd', 'e')
str(cc_chr)
##  chr [1:5] "a" "b" "c" "d" "e"
#
# #Logical
ll_lgl <- c(TRUE, FALSE, FALSE, TRUE, TRUE)
str(ll_lgl)
##  logi [1:5] TRUE FALSE FALSE TRUE TRUE

Integer

# #Integer Vector of Length 1
nn <- 5L
#
# #Colon ":" Operator - Avoid its usage
str(c(1:nn))
##  int [1:5] 1 2 3 4 5
c(typeof(pi:6), typeof(6:pi))
## [1] "double"  "integer"
#
# #seq() - Avoid its usage
str(seq(1, nn))
##  int [1:5] 1 2 3 4 5
str(seq(1, nn, 1))
##  num [1:5] 1 2 3 4 5
str(seq(1, nn, 1L))
##  num [1:5] 1 2 3 4 5
str(seq(1L, nn, 1L))
##  int [1:5] 1 2 3 4 5
#
# #seq_len()
str(seq_len(nn))
##  int [1:5] 1 2 3 4 5

Double

str(seq(1, 5, 1))
##  num [1:5] 1 2 3 4 5

Character

str(letters[1:5])
##  chr [1:5] "a" "b" "c" "d" "e"

Logical

str(1:5 %% 2 == 0)
##  logi [1:5] FALSE TRUE FALSE TRUE FALSE

1.5 DataFrame

# #Create Two Vectors
income <- c(100, 200, 300, 400, 500)
gender <- c("male", "female", "female", "female", "male")
#
# #Create a DataFrame
bb <- data.frame(income, gender)
#
# #Print or View DataFrame
#View(bb)
print(bb)
##   income gender
## 1    100   male
## 2    200 female
## 3    300 female
## 4    400 female
## 5    500   male
#
# #Struture
str(bb)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
#
# #Names
names(bb)
## [1] "income" "gender"

1.6 Save and Load an R Script

R Script file extension is “.R”

"Ctrl+ S" will Open Save Window at Working Directory.

"Ctrl+ O" will Open the Browse Window at Working Directory.

Check File Exist

# #Subdirectory "data" has data files like .csv .rds .txt .xlsx
# #Subdirectory "code" has scripts files like .R 
# #Subdirectory "images" has images like .png
#
# #Check if a File exists 
path_relative <- "data/aa.xlsx" #Relative Path
#
if(file.exists(path_relative)) {
    cat("File Exists\n") 
  } else {
    cat(paste0("File does not exist at ", getwd(), "/", path_relative, "\n"))
  }
## File Exists
#
if(exists("XL", envir = .z)) {
  cat(paste0("Absolute Path exists as: ", .z$XL, "\n"))
  path_absolute <- paste0(.z$XL, "aa", ".xlsx") #Absolute Path
  #
  if(file.exists(path_absolute)) {
    cat("File Exists\n") 
  } else {
    cat(paste0("File does not exist at ", path_absolute, "\n"))
  }
} else {
  cat(paste0("Object 'XL' inside Hidden Environment '.z' does not exist. \n", 
             "It is probably File Path of the Author, Replace the File Path from Your own Directory\n"))
}
## Absolute Path exists as: D:/Analytics/xADSM/data/
## File Exists

Aside

  • This section is NOT useful for general reader and can be safely ignored. It contains my notes related to building this book. These are useful only for someone who is building his own book. (Shivam)
  • “Absolute Path” is NOT a problem in Building a Book, Knitting a Chapter, or on Direct Console.
  • “Absolute Path” has a problem only when Running code chunk directly from the Rmd document and when the Rmd document is inside a sub-directory (like in this book), then only the Working Directory differs.

1.7 CSV Import /Export

write.csv() and read.csv() combination can be used to export data and import it back into R. But, it has some limitations -

  • Re-imported object “yy_data” will NOT match with the original object “xx_data” under default conditions
    1. write.csv(), by default, write row.names (or row numbers) in the first column.
      • So, either use row.names = FALSE while writing
      • OR use row.names = 1 while reading
    2. row.names attribute is always read as ‘character’ even though originally it might be ‘integer.’
      • So, that attribute needs to be coerced
    3. colClasses() needs to be defined to match with the original dataframe, otherwise ‘income’ is read as ‘integer,’ even though originally it was ‘numeric.’
    4. Conclusion: Avoid, if possible.
  • Alternative: saveRDS() and readRDS()
    • Functions to write a single R object to a file, and to restore it.
    • Imported /Exported objects are always identical
ERROR 1.1 Error in file(file, ifelse(append, "a", "w")) : cannot open the connection
  • Check the path, file name, & file extension for typing mistakes
  • Execute getwd(), just before the command, to confirm that the working directory is as expected

write.csv()

str(bb)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
#
xx_data <- bb
#
# #Write a dataframe to a CSV File
write.csv(xx_data, "data/B09_xx_data.csv")
#
# #Read from the CSV into a dataframe
yy_data <- read.csv("data/B09_xx_data.csv")
#
# #Check if the object being read is same as the obejct that was written 
identical(xx_data, yy_data)
## [1] FALSE

Match Objects

# #Exercise to show how to match the objects being imported /exported from CSV
str(bb)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
xx_data <- bb
# #Write to CSV
write.csv(xx_data, "data/B09_xx_data.csv")
#
# #Read from CSV by providing row.names Column and colClasses()
yy_data <- read.csv("data/B09_xx_data.csv", row.names = 1,
                    colClasses = c('character', 'numeric', 'character'))
#
# #Coerce row.names attribute to integer
attr(yy_data, "row.names") <- as.integer(attr(yy_data, "row.names"))
#
# #Check if the objects are identical
identical(xx_data, yy_data)
## [1] TRUE
stopifnot(identical(xx_data, yy_data))

RDS

str(bb)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
xx_data <- bb
#
# #Save the Object as RDS File
saveRDS(xx_data, "data/B09_xx_data.rds")
#
# #Read from the RDS File
yy_data <- readRDS("data/B09_xx_data.rds")
#
# #Objects are identical (No additional transformations are needed)
identical(xx_data, yy_data)
## [1] TRUE

1.8 Modify Dataframe

str(xx_data)
## 'data.frame':    5 obs. of  2 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
# #Adding a Column to a dataframe
xx_data <- data.frame(xx_data, age = 22:26)
#
# #Adding a Column to a dataframe by adding a Vector
x_age <- 22:26
xx_data <- data.frame(xx_data, x_age)
str(xx_data)
## 'data.frame':    5 obs. of  4 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
##  $ age   : int  22 23 24 25 26
##  $ x_age : int  22 23 24 25 26
#
# #Adding a Column to a dataframe by using dollar "$"
xx_data$age1 <- x_age
#
# #Adding a Blank Column using NA
xx_data$blank <- NA
#
# #Editing of a dataframe can also be done
# edit(xx_data)
str(xx_data)
## 'data.frame':    5 obs. of  6 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
##  $ age   : int  22 23 24 25 26
##  $ x_age : int  22 23 24 25 26
##  $ age1  : int  22 23 24 25 26
##  $ blank : logi  NA NA NA NA NA
#
# #Removing a Column by subsetting
xx_data <- xx_data[ , -c(3)]
#
# #Removing a Column using NULL
xx_data$age1 <- NULL
str(xx_data)
## 'data.frame':    5 obs. of  4 variables:
##  $ income: num  100 200 300 400 500
##  $ gender: chr  "male" "female" "female" "female" ...
##  $ x_age : int  22 23 24 25 26
##  $ blank : logi  NA NA NA NA NA

1.9 Packages

Definition 1.3 Packages are the fundamental units of reproducible R code.

Packages include reusable functions, the documentation that describes how to use them, and sample data.

In R Studio: Packages Tab | Install | Package Name = “psych” | Install

  • Packages are installed from CRAN Servers
    • To Change Server: Tools | Global Options | Packages | Primary CRAN Repository | Change | CRAN Mirrors (Select Your Preference) | OK
    • All Installed Packages are listed under Packages Tab
    • All Loaded Packages are listed under Packages Tab with a Tick Mark
    • Some packages are dependent on other packages and those are also installed when ‘dependencies = TRUE’
    • If a package is NOT installed properly, it will show error when loaded by library()

Install Packages

if(FALSE){
  # #WARNING: This will install packages and R Studio will NOT work for that duration
  # #Install Packages and their dependencies
  install.packages("psych", dependencies = TRUE)
}

Load Packages

# #Load a Package with or without Quotes
library(readxl)
library("readr")

Load Multiple Packages

# #Load Multiple Packages
pkg_chr <- c("ggplot2", "tibble", "tidyr", "readr", "dplyr")
#lapply(pkg_chr, FUN = function(x) {library(x, character.only = TRUE)})
#
# #Load Multiple Packages, Suppress Startup Messages, and No console output
invisible(lapply(pkg_chr, FUN = function(x) {
  suppressMessages(library(x, character.only = TRUE))}))

Detach Package

# #Detach a package
#detach("package:psych", unload = TRUE)
#
# #Search Package in the already loaded packages
pkg_chr <- "psych"
if (pkg_chr %in% .packages()) {
# #Detach a package that has been loaded previously
  detach(paste0("package:", pkg_chr), character.only = TRUE, unload = TRUE)
}

Install Older Version of Package

# #When Update of a Package breaks your code and you want to postpone that debugging
# #Get the URL of older version of the Package from CRAN
packageurl <- "https://cran.r-project.org/src/contrib/Archive/latex2exp/latex2exp_0.5.0.tar.gz"
#
if(FALSE) {# #WARNING: Installation may take some time.
  install.packages(packageurl, repos = NULL, type = "source")  
}

Package Version

packageVersion("dplyr")
## [1] '1.0.7'
packageVersion("latex2exp")
## [1] '0.9.3'

Rtools4 on Windows

Caution: It is NOT recommeded. However the instructions are available at Rtools4

# #Only for the debugging purposes, install from GitHub.
if(FALSE) {# #WARNING: Installation may take some time.
  Sys.which("make") #"D:\\Installations\\rtools40\\usr\\bin\\make.exe"
  devtools::install_github("stefano-meschiari/latex2exp", ref = "0.9.3") 
}

Update Packages

  • Update Packages from RStudio in general
  • However there are Packages that come together with R. Those needs to be updated with each R update
    • Run R as an administrator.
    • Packages | Updaate Packages
      • If it asks for “Do you want to use Personal Directory” - Decline. It is another headache.
      • If Some Packages fail to update in Administrator Mode, Rerun the update one by one. It works.

1.10 Import Flights Data

To Import Excel in R Studio : Environment | Dropdown | From Excel | Browse

Object imported by read.csv() i.e. ‘mydata’ is NOT same as the one imported by read_excel() i.e. ‘mydata_xl’

  • read_excel() imports as a Tibble which is a modern view of dataframe. It is more restrictive so that output would be more predictable.
  • read.csv(), if possible, imports as integer (ex: ‘year’ column). But, read_excel() imports, if possible, as a numeric.
  • Further, read_excel() has imported many columns as ‘character’ that should have been ‘numeric’ ex: dep_time
  • NOTE: To complete the set readr::read_csv() is also covered here which reads CSV and generates a Tibble.

All of these objects can be converted into any other form as needed i.e. dataframe to tibble or vice-versa.

Flights

# #Data File Name has been modified to include lecture number "B09"
# #All Data Files are in the sub-directory named 'data'
mydata <- read.csv("data/B09-FLIGHTS.csv")
#
# #To Copy from Clipboard, assuming copied from xlsx i.e. tab separated data
mydata_clip <- read.csv("clipboard", sep = '\t', header = TRUE)

RDS

# #Following Setup allows us to read CSV only once and then create an RDS file
# #Its advantage is in terms of faster loading time and lower memory requirment
xx_csv <- paste0("data/", "B09-FLIGHTS", ".csv")
xx_rds <- paste0("data/", "b09_flights", ".rds")
b09_flights <- NULL
if(file.exists(xx_rds)) {
  b09_flights <- readRDS(xx_rds)
} else {
  # #Read CSV
  b09_flights <- read.csv(xx_csv)
  # #Write Object as RDS
  saveRDS(b09_flights, xx_rds)
}
rm(xx_csv, xx_rds)
mydata <- b09_flights

Structure

str(mydata)
## 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : int  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : int  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : int  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : int  1400 1416 1089 1576 762 719 1065 229 944 733 ...
##  $ hour          : int  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : int  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : chr  "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

Tail

tail(mydata)
##        year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## 336771 2013     9  30       NA           1842        NA       NA           2019        NA      EV
## 336772 2013     9  30       NA           1455        NA       NA           1634        NA      9E
## 336773 2013     9  30       NA           2200        NA       NA           2312        NA      9E
## 336774 2013     9  30       NA           1210        NA       NA           1330        NA      MQ
## 336775 2013     9  30       NA           1159        NA       NA           1344        NA      MQ
## 336776 2013     9  30       NA            840        NA       NA           1020        NA      MQ
##        flight tailnum origin dest air_time distance hour minute           time_hour
## 336771   5274  N740EV    LGA  BNA       NA      764   18     42 2013-09-30 18:00:00
## 336772   3393    <NA>    JFK  DCA       NA      213   14     55 2013-09-30 14:00:00
## 336773   3525    <NA>    LGA  SYR       NA      198   22      0 2013-09-30 22:00:00
## 336774   3461  N535MQ    LGA  BNA       NA      764   12     10 2013-09-30 12:00:00
## 336775   3572  N511MQ    LGA  CLE       NA      419   11     59 2013-09-30 11:00:00
## 336776   3531  N839MQ    LGA  RDU       NA      431    8     40 2013-09-30 08:00:00

Excel

# #library(readxl)
mydata_xl <- read_excel("data/B09-FLIGHTS.xlsx", sheet = "FLIGHTS")

Excel RDS

# #library(readxl)
xx_xl <- paste0("data/", "B09-FLIGHTS", ".xlsx")
xx_rds_xl <- paste0("data/", "b09_flights_xls", ".rds")
b09_flights_xls <- NULL
if(file.exists(xx_rds_xl)) {
  b09_flights_xls <- readRDS(xx_rds_xl)
} else {
  b09_flights_xls <- read_excel(xx_xl, sheet = "FLIGHTS")
  saveRDS(b09_flights_xls, xx_rds_xl)
}
rm(xx_xl, xx_rds_xl)
mydata_xl <- b09_flights_xls
#

xlsx

str(mydata_xl)
## tibble [336,776 x 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : num [1:336776] 2013 2013 2013 2013 2013 ...
##  $ month         : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : chr [1:336776] "517" "533" "542" "544" ...
##  $ sched_dep_time: num [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : chr [1:336776] "2" "4" "2" "-1" ...
##  $ arr_time      : chr [1:336776] "830" "850" "923" "1004" ...
##  $ sched_arr_time: num [1:336776] 819 830 850 1022 837 ...
##  $ arr_delay     : chr [1:336776] "11" "20" "33" "-18" ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : num [1:336776] 1545 1714 1141 725 461 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : chr [1:336776] "227" "227" "160" "183" ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

readr

# #Following Setup allows us to read CSV only once and then create an RDS file
# #Its advantage is in terms of faster loading time and lower memory requirment
# #library(readr)
xx_csv <- paste0("data/", "B09-FLIGHTS", ".csv")
xx_rds <- paste0("data/", "xxflights", ".rds")
xxflights <- NULL
if(file.exists(xx_rds)) {
  xxflights <- readRDS(xx_rds)
} else {
  xxflights <- read_csv(xx_csv, show_col_types = FALSE)
  attr(xxflights, "spec") <- NULL
  attr(xxflights, "problems") <- NULL
  saveRDS(xxflights, xx_rds)
}
rm(xx_csv, xx_rds)
mydata_rdr <- xxflights

1.11 Subsetting

# #Subset All Rows and last 3 columns
data6 <- mydata[ , c(17:19)]
str(data6)
## 'data.frame':    336776 obs. of  3 variables:
##  $ hour     : int  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute   : int  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour: chr  "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
# #Subset by deleting the 1:16 columns
data7 <- mydata[ , -c(1:16)]
stopifnot(identical(data6, data7))

1.12 Attach a Dataset

Caution: Attaching a Dataset should be avoided to prevent unexpected behaviour due to ‘masking.’ Using full scope resolution i.e. ‘data_frame$column_header’ would result in fewer bugs. However, if a Dataset has been attached, please ensure that it is detached also.

Caution: If a dataset is attached more than once e.g. 4 times, please note that there will be 4 copies attached to the environment. It can be checked with search(). Each needs to be detached.

if(FALSE){
  # #WARNING: Attaching a Dataset is discouraged because of 'masking'
  # #'dep_time' is Column Header of a dataframe 'mydata'
  tryCatch(str(dep_time), error = function(e) print(paste0(e)))
## [1] "Error in str(dep_time): object 'dep_time' not found\n"
  # #Attach the Dataset
  attach(mydata)
  # #Now all the column headers are accessible without the $ sign
  str(dep_time)
## int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
  # #But, there are other datasets also, attaching another one results in MESSAGE
  attach(mydata_xl)
## The following objects are masked from mydata:
##
##     air_time, arr_delay, arr_time, carrier, day, dep_delay, dep_time, dest,
##     distance, flight, hour, minute, month, origin, sched_arr_time,
##     sched_dep_time, tailnum, time_hour, year
  str(dep_time)
## chr [1:336776] "517" "533" "542" "544" "554" "554" "555" "557" "557" "558" "558" ...
#
# #'mydata_xl$dep_time' masked the already present 'mydata$dep_time'.
# #Thus now it is showing as 'chr' in place of original 'int'
# #Column Header Names can be highly varied and those will silently mask other variable
# #Hence, attaching a dataset would result in random bugs or unexpected behaviours
#
# #Detach a Dataset
  detach(mydata_xl)
  detach(mydata)
}

1.13 Package “psych”

  • pairs.panels() -
    • It shows a scatter plot of matrices, with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.
      • Calculation time is highly dependent on dataset size and type
      • See Figure 1.1
      • Conclusion: “air_time and distance are highly correlated”
ERROR 1.2 Error in plot.window(...) : need finite ’xlim’ values
  • In this case, the error will be observed if the output of pairs.panels() is assigned to an object.
  • Direct console output (i.e. no assignment) should not be a problem
ERROR 1.3 Error in par(old.par) : invalid value specified for graphical parameter "pin"
  • The Error is generally observed because the Plot does not have enough space in R Studio (Lower Right Pane). In general, it is NOT a problem. It is an error of map() xlim[1] should be less than xlim[2].
  • Use larger window size or control image size output

Image

(B09P01) Correlation using psych::pairs.panels()

Figure 1.1 (B09P01) Correlation using psych::pairs.panels()

Code

# # Subset 3 Columns and 10,000 rows 
x_rows <- 10000L
data_pairs <- mydata[1:x_rows, c(7, 16, 9)]
#
# #Equivalent
ii <- mydata %>%
  select(air_time, distance, arr_delay) %>%
  slice_head(n = x_rows)
#
stopifnot(identical(ii, data_pairs))
#
if( nrow(data_pairs) * ncol(data_pairs) > 1000000 ) {
  print("Please reduce the number of points to a sane number!")
  ggplot()
} else {
  #B09P01
  pairs.panels(data_pairs)
  if(FALSE){# Cleaner Graph
    pairs.panels(data_pairs, smooth = FALSE, jiggle = FALSE, rug = FALSE, ellipses = FALSE, 
                 cex.cor = 1, cex = 1, gap = 0, main = "Title")
    title(sub = "Caption", line = 4, adj = 1)   
  }
}

1.14 Operators in R

  • There are multiple infix binary operators
  • \textcolor{pink}{a %in% b} : returns a logical vector indicating if there is a match or not for its left operand
  • \textcolor{pink}{%/%} and \textcolor{pink}{%%} perform integer division and modular division respectively
  • \textcolor{pink}{%o%} gives the outer product of arrays.
  • \textcolor{pink}{%*%} performs matrix multiplication.
  • \textcolor{pink}{%x%} performs the Kronecker product of arrays.
  • Magritter Pipe is given by \textcolor{pink}{%>%}
  • There are other pipes and other packages also but this is general summary.

1.15 Printing Decimal Numbers in R

  • By default, R gives numbers in Scientic Notation. ‘E’
    • Personally, it is irritating to read p-values, residuals etc. of model output and to convert them everywhere.
    • Defaults can be changed, so that Scientific Notation is disabled
    • Functions round(), signif() does not have ‘E’ option
    • Functions sprintf(), prettyNum(), format() converts to ‘character’
      • And there is a formatC() which does not get affected by global options()
Definition 1.4 Rounding means replacing a number with an approximate value that has a shorter, simpler, or more explicit representation.
Definition 1.5 Significant digits, (or significant figures, or precision or resolution), of a number in positional notation are digits in the number that are reliable and necessary to indicate the quantity of something.
  • Significant digits
    • Even when some of the digits are not certain, as long as they are reliable, they are considered significant because they indicate the actual volume within the acceptable degree of uncertainty.
    • Not Significant digits
      • Leading Zeroes e.g. 013 kg or 0.056 m (= 56 mm) both have 2 significant digits
      • Trailing zeros when they are merely placeholders

Change Defaults

if(FALSE) {# #Disable Scientific Notation 'E' in R. Run Once for current session. 
# #Put it in .Rprofile File to always execute this.  
  options(scipen = 999)
}
#
if(FALSE) {# #To revert back to original defaults
  options(scipen = 0, digits = 7)
}

Numerical Printing

# #These change in options have NOT been executed to prevent any uninteded consequences
if(FALSE) options(scipen = 0, digits = 7) #From the Disbled 'E' to Default for testing
# #Sequence of Powers of 10
ii <- 10^(2:-4) 
jj <- abs(ii - 1L)
#
round(ii, 2)  #1e+02 1e+01 1e+00 1e-02 0e+00 0e+00
signif(ii, 2) #1e+02 1e+01 1e+00 1e-02 1e-03 1e-04
#
# #Follwoing are better BUT print "character" not "numeric" i.s. suitable for final printing only
#"100"    "10"     "1"      "0.1"    "0.01"   "0.001"  "0.0001"
prettyNum(ii, scientific = FALSE)
format(ii, scientific = FALSE, drop0trailing = TRUE, trim = TRUE)
#
if(FALSE) options(scipen = 999) #From Default to the Disabled 'E'
#
# #round() does not distinguish between 0.001 & 0.0001 and always prints specified decimal places
round(ii, 2)  #100.00  10.00   1.00   0.10   0.01   0.00   0.00
# #signif() handles significant digits but prints trailing zeroes depending upon lowest value
signif(ii, 2) #100.0000  10.0000   1.0000   0.1000   0.0100   0.0010   0.0001
#
round(jj, 2)
signif(jj, 2) #Rounds 0.999 to 1
#
# #Printing "character" and digits option is for rounding
#"100"    "10"     "1"      "0.1"    "0.01"   "0.001"  "0.0001"
prettyNum(ii)
format(ii, drop0trailing = TRUE, trim = TRUE)
#
prettyNum(jj)
format(jj, drop0trailing = TRUE, trim = TRUE, digits = 2) #Rounds 0.999 to 1
#
# #formatC() does not get affected by change in options(). This is undesirable.
formatC(ii, digits = 2)

1.16 Find Datasets

# #List All Datasets of a Loaded Package
data(package = "ggplot2")$results[ , "Item"]
##  [1] "diamonds"       "economics"      "economics_long" "faithfuld"      "luv_colours"   
##  [6] "midwest"        "mpg"            "msleep"         "presidential"   "seals"         
## [11] "txhousing"
data(package = "nycflights13")$results[ , "Item"]
## [1] "airlines" "airports" "flights"  "planes"   "weather"

Validation


2 R Introduction (B10, Sep-05)

2.2 Notebooks

These allow you to combine executable code and rich text in a single document, along with images, HTML, LaTeX and more.

Definition 2.1 R Markdown is a file format for making dynamic documents with R.

An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code. To know more Go to Rstudio

To know more about Google Colab Go to Google Colab

NOTE: As I am not using Google Colab, the workflow explained between 00:00 to 35:10 is NOT covered here. If someone is using Google Colab, and is willing to share their notes, I would include those.

2.3 Plot

Base R graphs /plots as shown in figure 2.1

ERROR 2.1 Error in plot.window(...) : need finite ’xlim’ values
  • If this error is coming when base R plot() function is called
  • Check if data has NA values or if character data is supplied where numerical is needed
  • Also, do not use assignment to save base R plot and then print
(B10P01) Flights: Arrival Time (Y) vs. Departure (X) Time

Figure 2.1 (B10P01) Flights: Arrival Time (Y) vs. Departure (X) Time

2.4 Dataset

  • Use cbind() or rbind() to merge dataframes

Dimensions

# #Create a Subset of Dataframe of 1000 Rows for quick calculations
bb <- head(mydata, 1000)
#
# #Dimensions: dim() Row x Column; nrow(); ncol()
dim(bb)
## [1] 1000   19
#
stopifnot(identical(nrow(bb), dim(bb)[1]))
stopifnot(identical(ncol(bb), dim(bb)[2]))

Split

# #Split a Dataframe by subsetting
data_1 <- bb[ , 1:8]
data_2 <- bb[ , 9:19]
# str(data_1)

Merge

# #Merge a Dataframe by cbind()
data_3 <- cbind(data_1, data_2)
# #Equivalent
data_4 <- data.frame(data_1, data_2)
# str(bb_3)
stopifnot(identical(data_3, data_4))

RowSplit

# #Row Split
data_5 <- bb[1:300, ]
data_6 <- bb[301:1000, ]
#
# #Equivalent
n_rows <- 300L
data_5 <- bb[1:n_rows, ]
data_6 <- bb[(n_rows + 1L):nrow(bb), ]
#
stopifnot(identical(data_5, head(bb, n_rows)))
stopifnot(identical(data_6, tail(bb, (nrow(bb) - n_rows))))

RowMerge

# #Merge a Dataframe by rbind()
data_7 <- rbind(data_5, data_6)
stopifnot(identical(bb, data_7))

2.5 Change Column Headers

# #Change A Specific Name based on Index Ex: First Header "year" -> "YEAR"
# #NOTE: Output of 'names(bb)' is a character vector, not a dataframe
# #So, [1] is being used to subset for 1st element and NOT the [ , 1] (as done for dataframe)
(names(bb)[1] <- "YEAR")
## [1] "YEAR"
#
# #Change all Column Headers to Uppercase by toupper() or Lowercase by tolower()
names(bb) <- toupper(names(bb))

2.6 NA

Definition 2.2 NA is a logical constant of length 1 which contains a missing value indicator.

NA can be coerced to any other vector type except raw. There are also constants like NA_integer_, NA_real_ etc. For checking only the presence of NA, anyNA() is faster than is.na()

Overview of ‘Not Available’

  • If the imported data has blank cell, it would be imported as NA

To remove all NA

  • na.omit()
    • Output is a dataframe
    • It is slower but adds the omitted row numbers as an attribute i.e. na.action
  • complete.cases()
    • Output is a logical vector, thus it needs subsetting to get the dataframe
    • Faster and also allows partial selection of columns i.e. ignore NA in other columns
    • Caution: It may throw Error if ‘POSIXlt’ Columns are present
  • tidyr::drop_na()
  • rowSums(is.na())
    • It can also be used for excluding rows with more than allowed numbers of NA. However, in general, this is not recommended because random columns retain NA. These may break the code later or change the number of observations. It is useful when all columns are similar in nature e.g. if each column represent response to a survey question.

NA

bb <- xxflights
# #anyNA() is faster than is.na()
if(anyNA(bb)) print("NA are Present!") else print("NA not found")
## [1] "NA are Present!"
#
# #Columnwise NA Count
bb_na_col <- colSums(is.na(bb))
# #
bb %>% summarise(across(everything(), ~ sum(is.na(.)))) %>% 
  pivot_longer(everything()) %>% filter(value > 0)
## # A tibble: 6 x 2
##   name      value
##   <chr>     <int>
## 1 dep_time   8255
## 2 dep_delay  8255
## 3 arr_time   8713
## 4 arr_delay  9430
## 5 tailnum    2512
## 6 air_time   9430
#
colSums(is.na(bb)) %>% as_tibble(rownames = "Cols") %>% filter(value > 0)
## # A tibble: 6 x 2
##   Cols      value
##   <chr>     <dbl>
## 1 dep_time   8255
## 2 dep_delay  8255
## 3 arr_time   8713
## 4 arr_delay  9430
## 5 tailnum    2512
## 6 air_time   9430
#
# #Vector of Columns having NA
which(bb_na_col != 0)
##  dep_time dep_delay  arr_time arr_delay   tailnum  air_time 
##         4         6         7         9        12        15
stopifnot(identical(which(bb_na_col != 0), which(vapply(bb, anyNA, logical(1)))))
#
# #Indices of Rows with NA
head(which(!complete.cases(bb)))
## [1] 472 478 616 644 726 734
#
# #How many rows contain NA
sum(!complete.cases(bb))
## [1] 9430
#
# #How many rows have NA in specific Columns
sum(!complete.cases(bb[ , c(6, 9, 4)]))
## [1] 9430

RemoveNA

# #Remove all rows which have any NA 
# #na.omit(), complete.cases(), tidyr::drop_na(), rowSums(is.na())
bb_1 <- na.omit(bb)
# #Print the Count of removed rows containg NA
print(paste0("Note: ", length(attributes(bb_1)$na.action), " rows removed."))
## [1] "Note: 9430 rows removed."
#
# #Remove additional Attribute added by na.omit()
attr(bb_1, "na.action") <- NULL
#
# #Equivalent 
bb_2 <- bb[complete.cases(bb), ]
bb_3 <- bb %>% drop_na()
bb_4 <- bb[rowSums(is.na(bb)) == 0, ]
#Validation
stopifnot(all(identical(bb_1, bb_2), identical(bb_1, bb_3), identical(bb_1, bb_4)))
#
# #complete.cases also allow partial selection of specific columns
# #Remove rows which have NA in some columns i.e. ignore NA in other columns
dim(bb[complete.cases(bb[ , c(6, 9, 4)]), ])
## [1] 327346     19
# #Equivalent 
dim(bb %>% drop_na(dep_delay, arr_delay, dep_time))
## [1] 327346     19
#
# #Remove rows which have more than allowed number of NA (ex:4) in any column
# #Caution: In general, this is not recommended because random columns retain NA
dim(bb[rowSums(is.na(bb)) <= 4L, ])
## [1] 328521     19

2.7 Apply

Sources: (SO) Grouping Functions and the Apply Family, (SO) Why is vapply safer than sapply, Hadley - Advanced R - Functionals, This, This, & This

Apply Function in R are designed to avoid explicit use of loop constructs.

  • To manipulate slices of data in a repetitive way.
  • They act on an input list, matrix or array, and apply a named function with one or several optional arguments.
  1. apply(X, MARGIN, FUN, ..., simplify = TRUE)
    • Refer R Manual p72 - “Apply Functions Over Array Margins”
    • Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
    • When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first
    • MARGIN = 1 indicates application over ROWS, 2 indicates COLUMNS
    • Examples & Details: “ForLater”
  2. lapply(X, FUN, ...)
    • Refer R Manual p342 - “Apply a Function over a List or Vector”
    • ‘list’ apply i.e. lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
    • Examples & Details: “ForLater”
    • When you want to apply a function to each element of a list in turn and get a list back.
    • lapply(x, mean)
    • lapply(x, function(x) c(mean(x), sd(x)))
  3. sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
    • ‘simplified’ wrapper of lapply
    • When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
    • Caution: It sometimes fails silently or unexpectedly changes output type
  4. vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
    • ‘verified’ apply i.e. vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.
    • vapply returns a vector or array of type matching the FUN.VALUE.
    • With FUN.VALUE you can specify the type and length of the output that should be returned each time your applied function is called.
    • It improves consistency by providing limited return type checks.
    • Further, if the input length is zero, sapply will always return an empty list, regardless of the input type (Thus behaving differently from non-zero length input). Whereas, with vapply, you are guaranteed to have a particular type of output, so you do not need to write extra checks for zero length inputs.
  5. Others - “ForLater”
    • tapply is a tagged apply where the tags identify the subsets
    • mapply for applying a function to multiple arguments
    • rapply for a ‘recursive’ version of lapply
    • eapply for applying a function to each entry in an ‘environment’
# #Subset Dataframe 
bb <- xxflights
data_8 <- bb[ , c("dep_delay", "arr_delay", "dep_time")]
#data_8 <- bb %>% select(dep_delay, arr_delay, dep_time) 
#
# #Remove missing values
data_9 <- na.omit(data_8)
#
# #Calculate Columnwise Mean
(bb_1 <- apply(data_9, 2, mean))
##   dep_delay   arr_delay    dep_time 
##   12.555156    6.895377 1348.789883
bb_2 <- unlist(lapply(data_9, mean))
bb_3 <- sapply(data_9, mean)
bb_4 <- vapply(data_9, mean, numeric(1))
#
stopifnot(all(identical(bb_1, bb_2), identical(bb_1, bb_3), identical(bb_1, bb_4)))

2.8 Vectors

Refer The 6 Datatypes of Atomic Vectors

Create a Basic Tibble, Table2.1, for evaluating ‘is.x()’ series of functions in Base R

  • anyNA() is TRUE if there is an NA present, FALSE otherwise
  • is.atomic() is TRUE for All Atomic Vectors, factor, matrix but NOT for list
  • is.vector() is TRUE for All Atomic Vectors, list but NOT for factor, matrix, DATE & POSIXct
    • Caution: With vapply() it returns TRUE for matrix (it checks individual elements)
    • Caution: FALSE if the vector has attributes (except names) ex: DATE & POSIXct
  • is.numeric() is TRUE for both integer and double
  • is.integer(), is.double(), is.character(), is.logical() are TRUE for their respective datatypes only
  • is.factor(), is.ordered() are membership functions for factors with or without ordering
    • For more: nlevels(), levels()
  • lubridate
    • is.timepoint() is TRUE for POSIXct, POSIXlt, or Date
    • is.POSIXt(), is.Date() are TRUE for their respective datatypes only
Table 2.1: (B10T01) Vector Classes
ii dd cc ll ff fo dtm dat
1 1 a FALSE odd odd 2022-02-08 23:47:05 2022-02-09
2 2 b TRUE even even 2022-02-08 23:47:06 2022-02-10
3 3 c FALSE odd odd 2022-02-08 23:47:07 2022-02-11
4 4 d TRUE even even 2022-02-08 23:47:08 2022-02-12
5 5 e FALSE odd odd 2022-02-08 23:47:09 2022-02-13
6 6 f TRUE even even 2022-02-08 23:47:10 2022-02-14

Basic Tibble

# #Basic Tibble
nn <- 6L
xxbasic10 <- tibble(ii = 1:nn, dd = seq(1, nn, 1), cc = head(letters, nn), 
             ll = (ii %% 2) == 0, ff = factor(rep(c("odd", "even"), length.out = nn)),
             fo = factor(rep(c("odd", "even"), length.out = nn), ordered = TRUE),
             dtm = Sys.time() + 1:nn, dat = Sys.Date() + 1:nn)
bb <- xxbasic10
str(bb)

is

# #Validation
# #anyNA() is TRUE if there is an NA present, FALSE otherwise
vapply(bb, anyNA, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.atomic() is TRUE for All Atomic Vectors, factor, matrix but NOT for list
vapply(bb, is.atomic, logical(1))
##   ii   dd   cc   ll   ff   fo  dtm  dat 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#
# #is.vector() is TRUE for All Atomic Vectors, list but NOT for factor, matrix, DATE & POSIXct
# #CAUTION: With vapply() it returns TRUE for matrix (it checks individual elements)
# #CAUTION: FALSE if the vector has attributes (except names) ex: DATE & POSIXct
vapply(bb, is.vector, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
##  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
#
# #is.numeric() is TRUE for both integer and double
vapply(bb, is.numeric, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
##  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.integer() is TRUE only for integer
vapply(bb, is.integer, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
##  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.double() is TRUE only for double
vapply(bb, is.double, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
#
# #is.character() is TRUE only for character
vapply(bb, is.character, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
#
# #is.logical() is TRUE only for logical
vapply(bb, is.logical, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

Factor

# #Factors
# #is.factor() is TRUE only for factor (unordered or ordered)
vapply(bb, is.factor, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
#
# #is.ordered() is TRUE only for ordered factor
vapply(bb, is.ordered, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#
# #nlevels()
vapply(bb, nlevels, integer(1))
##  ii  dd  cc  ll  ff  fo dtm dat 
##   0   0   0   0   2   2   0   0
#
# #levels()
vapply(bb, function(x) !is.null(levels(x)), logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
#
# #table()
table(bb$ff)
## 
## even  odd 
##    3    3

lubridate::is

# #Package lubridate covers the missing functions for POSIXct, POSIXlt, or Date 
# #is.timepoint() is TRUE for POSIXct, POSIXlt, or Date
vapply(bb, is.timepoint, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
#
# #is.POSIXt() is TRUE only for POSIXct 
vapply(bb, is.POSIXt, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#
# #is.Date() is only TRUE for DATE 
vapply(bb, is.Date, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Duplicates

# #Which Columns have Duplicate Values
vapply(bb, function(x) anyDuplicated(x) != 0L, logical(1))
##    ii    dd    cc    ll    ff    fo   dtm   dat 
## FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE

2.9 Factors

Definition 2.3 Factors are the data objects which are used to categorize the data and store it as levels.

They can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like “Male, Female” and “True, False” etc. They are useful in data analysis for statistical modelling.

Factor is nothing but the numeric representation of the character vector.

as.factor() vs. factor()

  • as.factor() is faster than factor() when input is a factor or integer
  • as.factor retains unused or NA levels whereas factor drops them
    • levels can also be dropped using droplevels()
  • (SO) Levels vs. Labels
    • Levels are Input, Labels are Output in factor().
    • There is only ‘level’ attribute, no ‘label’ attribute.
    • In R (unlike SPSS) there is NO difference between what is stored and what is displayed. As soon as the levels (“papaya,” “banana”) are given labels (“pink,” “black”), there is NO way to get back the original levels.
    • (Aside) This misconception persists because generally we change between numerical to factor or binary character to factor. In these situations generally we know which level shows what and edit /sort them immediately.
  • Caution: It is highly advised to use ‘levels’ when creating factors to keep a link on their ordering and not to be dependent upon their occurance in the vector.
    • Further, this helps while relabeling them because the order is known.
    • By using ‘forcats’ or ‘car’ packages (probably) it can be done for default situation. By sorting factor levels by their integer equivalent before assigning new labels to them. However, this is low priority for now.

Transformation

str(bb$ll)
##  logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
# #Coercion to Factor
bb$new <- as.factor(bb$ll)
str(bb$new)
##  Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 2
#
# #table()
table(bb$ll)
## 
## FALSE  TRUE 
##     3     3
table(bb$new)
## 
## FALSE  TRUE 
##     3     3
#
# #Levels can be Labelled differently also
str(bb$ff)
##  Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
# # 
str(factor(bb$ff, levels = c("even", "odd"), labels = c("day", "night")))
##  Factor w/ 2 levels "day","night": 2 1 2 1 2 1
str(factor(bb$ff, levels = c("odd", "even"), labels = c("day", "night")))
##  Factor w/ 2 levels "day","night": 1 2 1 2 1 2
#
# #Coercion from Factor to character, logical etc.
bb$xcc <- as.character(bb$new)
bb$xll <- as.logical(bb$new)
#
str(bb)
## tibble [6 x 11] (S3: tbl_df/tbl/data.frame)
##  $ ii : int [1:6] 1 2 3 4 5 6
##  $ dd : num [1:6] 1 2 3 4 5 6
##  $ cc : chr [1:6] "a" "b" "c" "d" ...
##  $ ll : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
##  $ ff : Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
##  $ fo : Ord.factor w/ 2 levels "even"<"odd": 2 1 2 1 2 1
##  $ dtm: POSIXct[1:6], format: "2022-02-08 23:47:05" "2022-02-08 23:47:06" "2022-02-08 23:47:07" ...
##  $ dat: Date[1:6], format: "2022-02-09" "2022-02-10" "2022-02-11" ...
##  $ new: Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 2
##  $ xcc: chr [1:6] "FALSE" "TRUE" "FALSE" "TRUE" ...
##  $ xll: logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE

Flights

bb <- xxflights
aa <- c("month", "day")
str(bb[aa])
## tibble [336,776 x 2] (S3: tbl_df/tbl/data.frame)
##  $ month: num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day  : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
# #To factor
bb$day <- as.factor(bb$day)
bb$month <- as.factor(bb$month)
# #Equivalent
#bb[aa] <- lapply(bb[aa], as.factor)
str(bb[aa])
## tibble [336,776 x 2] (S3: tbl_df/tbl/data.frame)
##  $ month: Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ day  : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...

Re-Label

# #Unordered Named Vector of Fruits with Names of Colours
# #NOTE: First letters of each colour and fruit match
ii <- c("pink" = "papaya", "black" = "banana", "orchid" = "orange", "amber" = "apple")
ii
##     pink    black   orchid    amber 
## "papaya" "banana" "orange"  "apple"
#
# #Factor Vectors (default is Alphabetical Sorting) using unname() to remove names
fruit_base <- factor(unname(ii))
# #sort()
fruit_sort <- factor(unname(sort(ii)))
# #unique() provides the values in the sequence of their appearance 
fruit_uniq <- factor(unname(ii), levels = unique(ii))
#
# #By Default Levels Match even though the actual Vectors do not Match
stopifnot(identical(levels(fruit_base), levels(fruit_sort)))
fruit_base
## [1] papaya banana orange apple 
## Levels: apple banana orange papaya
fruit_sort
## [1] apple  banana orange papaya
## Levels: apple banana orange papaya
fruit_uniq
## [1] papaya banana orange apple 
## Levels: papaya banana orange apple
#
# #Relabelling: First letters should always match between Fruits and Colours
color_base <- fruit_base
color_sort <- fruit_sort
color_uniq <- fruit_uniq
#
levels(color_base) <- names(ii)[match(color_base, ii)] #WRONG
levels(color_sort) <- names(ii)[match(color_sort, ii)]
levels(color_uniq) <- names(ii)[match(color_uniq, ii)]
#
# #CAUTION: This is WRONG. 
color_base #WRONG
## [1] amber  black  orchid pink  
## Levels: pink black orchid amber
#
color_sort 
## [1] amber  black  orchid pink  
## Levels: amber black orchid pink
color_uniq 
## [1] pink   black  orchid amber 
## Levels: pink black orchid amber

2.10 Lists

Definition 2.4 Lists are by far the most flexible data structure in R. They can be seen as a collection of elements without any restriction on the class, length or structure of each element.

Caution: The only thing you need to take care of, is that you do not give two elements the same name. R will NOT throw ERROR.

Definition 2.5 Data Frames are lists with restriction that all elements of a data frame are of equal length.

Due to the resulting two-dimensional structure, data frames can mimic some of the behaviour of matrices. You can select rows and do operations on rows. You cannot do that with lists, as a row is undefined there.

A Dataframe is intended to be used as a relational table. This means that elements in the same column are related to each other in the sense that they are all measures of the same metric. And, elements in the same row are related to each other in the sense that they are all measures from the same observation or measures of the same item. This is why when you look at the structure of a Dataframe, it will state the the number of observations and the number of variables instead of the number of rows and columns.

Dataframes are distinct from Matrices because they can include heterogenous data types among columns/variables. Dataframes do not permit multiple data types within a column/variable, for reasons that also follow from the relational table idea.

All this implies that you should use a data frame for any dataset that fits in that two-dimensional structure. Essentially, you use data frames for any dataset where a column coincides with a variable and a row coincides with a single observation in the broad sense of the word. For all other structures, lists are the way to go.

  • Does everything in R have (exactly one) class
    • Everything has (at least one) class. Objects can have multiple classes
    • It is mostly just the class attribute of an object. But when the class attribute is not set, the class() function makes up a class from the object ‘type’ and the ‘dim’ attribute.
    • lists and dataframes have same typeof ‘list’ but different class
  • Then what does typeof() tell us
    • It tells us the internal ‘storage mode’ of an object. How the R perceives the object and interacts with it.
    • An object has one and only one mode (SO) Difference between mode and class
    • class is an attribute and thus can be defined/overridden by a user, however, mode (i.e. typeof ) cannot be
  • To define an object, what should be known about it
    • class(), typeof(), is(), attributes(), str(), inherits(), …

list

# #CAUTION: Do not Create a list with duplicate names (R will NOT throw ERROR)
bb <- list(a=1, b=2, a=3)
# # 3rd index cannot be accessed using $
bb$a
## [1] 1
identical(bb$a, bb[[1]])
## [1] TRUE
identical(bb$a, bb[[3]])
## [1] FALSE
bb[[3]]
## [1] 3

class vs. typeof

# #Create a list
bb_lst <- list( a = c(1, 2), b = c('a', 'b', 'c'))
tryCatch(
# #Trying to create varying length of variables in dataframe like in list
  bb_dft <- data.frame(a = c(1, 2), b = c('a', 'b', 'c')), 
  error = function(e) {
# #Print ERROR
    cat(paste0(e))
# #Double Arrow Assignment '<<-' to assign in parent environment
    bb_dft <<- data.frame(a = c(1, 2), b = c('a', 'b'))
    }
  )
## Error in data.frame(a = c(1, 2), b = c("a", "b", "c")): arguments imply differing number of rows: 2, 3
#
# #Both list and dataframe have same type() 
typeof(bb_lst)
## [1] "list"
typeof(bb_dft)
## [1] "list"
#
# #But, class() is different for list and dataframe
class(bb_lst)
## [1] "list"
class(bb_dft)
## [1] "data.frame"
#
str(bb_lst)
## List of 2
##  $ a: num [1:2] 1 2
##  $ b: chr [1:3] "a" "b" "c"
str(bb_dft)
## 'data.frame':    2 obs. of  2 variables:
##  $ a: num  1 2
##  $ b: chr  "a" "b"
#
# #Although 'bb_lst_c' is a list but inside coercion takes place i.e. '9' is character
bb_lst_c <- list( a = c(8, 'x'), b = c('y', 9))
str(bb_lst_c[[2]][2])
##  chr "9"
#
# #Here, '9' is numeric, it is stored as list element so note the extra [[]]
bb_lst_l <- list( a = list(8, 'x'), b = list('y', 9))
str(bb_lst_l[[2]][[2]])
##  num 9

2.11 Matrix

# #Create a Matrix
bb_mat <- matrix(1:6, nrow = 2, ncol = 3)
print(bb_mat)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
str(bb_mat)
##  int [1:2, 1:3] 1 2 3 4 5 6
class(bb_mat)
## [1] "matrix" "array"
typeof(bb_mat)
## [1] "integer"

2.12 Merge

# #Basic Tibble
bb <- xxbasic10
str(bb)
## tibble [6 x 8] (S3: tbl_df/tbl/data.frame)
##  $ ii : int [1:6] 1 2 3 4 5 6
##  $ dd : num [1:6] 1 2 3 4 5 6
##  $ cc : chr [1:6] "a" "b" "c" "d" ...
##  $ ll : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
##  $ ff : Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
##  $ fo : Ord.factor w/ 2 levels "even"<"odd": 2 1 2 1 2 1
##  $ dtm: POSIXct[1:6], format: "2022-02-08 23:47:05" "2022-02-08 23:47:06" "2022-02-08 23:47:07" ...
##  $ dat: Date[1:6], format: "2022-02-09" "2022-02-10" "2022-02-11" ...
# #Split with 'cc' as common ID column
bb_a <- bb[1:3]
bb_b <- bb[3:ncol(bb)]
#
# #merge() using the common ID column 'cc'
bb_new <- merge(bb_a, bb_b, by = "cc")
bb_new
##   cc ii dd    ll   ff   fo                 dtm        dat
## 1  a  1  1 FALSE  odd  odd 2022-02-08 23:47:05 2022-02-09
## 2  b  2  2  TRUE even even 2022-02-08 23:47:06 2022-02-10
## 3  c  3  3 FALSE  odd  odd 2022-02-08 23:47:07 2022-02-11
## 4  d  4  4  TRUE even even 2022-02-08 23:47:08 2022-02-12
## 5  e  5  5 FALSE  odd  odd 2022-02-08 23:47:09 2022-02-13
## 6  f  6  6  TRUE even even 2022-02-08 23:47:10 2022-02-14

2.13 Sort

  • sort()
    • It sorts the vector in an ascending order
  • rank()
    • rank returns the order of each element in an ascending list
    • The smallest number receives the rank 1
    • If there are ties, it returns numeric not integer with ranks being 2.5 etc
  • order()
    • order returns the index each element would have in an ascending list
  • dplyr::arrange()
    • arrange() orders the rows of a data frame by the values of selected columns.
    • NA are always sorted to the end, even when wrapped with desc().
ERROR 2.2 Error in arrange(bb, day) : could not find function "arrange"
  • Load the Package (dplyr etc.) having the function.

order

bb <- xxflights
# #Sort ascending (default)
bb_1 <- bb[order(bb$dep_delay), ]
# #Sort descending
bb_2 <- bb[order(-bb$dep_delay), ]
#
bb[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
##   dep_time dep_delay tailnum carrier
##      <dbl>     <dbl> <chr>   <chr>  
## 1      517         2 N14228  UA     
## 2      533         4 N24211  UA     
## 3      542         2 N619AA  AA     
## 4      544        -1 N804JB  B6     
## 5      554        -6 N668DN  DL
bb_1[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
##   dep_time dep_delay tailnum carrier
##      <dbl>     <dbl> <chr>   <chr>  
## 1     2040       -43 N592JB  B6     
## 2     2022       -33 N612DL  DL     
## 3     1408       -32 N825AS  EV     
## 4     1900       -30 N934DL  DL     
## 5     1703       -27 N208FR  F9
bb_2[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
##   dep_time dep_delay tailnum carrier
##      <dbl>     <dbl> <chr>   <chr>  
## 1      641      1301 N384HA  HA     
## 2     1432      1137 N504MQ  MQ     
## 3     1121      1126 N517MQ  MQ     
## 4     1139      1014 N338AA  AA     
## 5      845      1005 N665MQ  MQ

Multi Column

bb <- xxbasic10
bb
## # A tibble: 6 x 8
##      ii    dd cc    ll    ff    fo    dtm                 dat       
##   <int> <dbl> <chr> <lgl> <fct> <ord> <dttm>              <date>    
## 1     1     1 a     FALSE odd   odd   2022-02-08 23:47:05 2022-02-09
## 2     2     2 b     TRUE  even  even  2022-02-08 23:47:06 2022-02-10
## 3     3     3 c     FALSE odd   odd   2022-02-08 23:47:07 2022-02-11
## 4     4     4 d     TRUE  even  even  2022-02-08 23:47:08 2022-02-12
## 5     5     5 e     FALSE odd   odd   2022-02-08 23:47:09 2022-02-13
## 6     6     6 f     TRUE  even  even  2022-02-08 23:47:10 2022-02-14
# #Sort ascending (default)
(bb_1 <- bb[order(bb$ll), ])
## # A tibble: 6 x 8
##      ii    dd cc    ll    ff    fo    dtm                 dat       
##   <int> <dbl> <chr> <lgl> <fct> <ord> <dttm>              <date>    
## 1     1     1 a     FALSE odd   odd   2022-02-08 23:47:05 2022-02-09
## 2     3     3 c     FALSE odd   odd   2022-02-08 23:47:07 2022-02-11
## 3     5     5 e     FALSE odd   odd   2022-02-08 23:47:09 2022-02-13
## 4     2     2 b     TRUE  even  even  2022-02-08 23:47:06 2022-02-10
## 5     4     4 d     TRUE  even  even  2022-02-08 23:47:08 2022-02-12
## 6     6     6 f     TRUE  even  even  2022-02-08 23:47:10 2022-02-14
# #Sort on Multiple Columns with ascending and descending
(bb_2 <- bb[order(bb$ll, -bb$dd), ])
## # A tibble: 6 x 8
##      ii    dd cc    ll    ff    fo    dtm                 dat       
##   <int> <dbl> <chr> <lgl> <fct> <ord> <dttm>              <date>    
## 1     5     5 e     FALSE odd   odd   2022-02-08 23:47:09 2022-02-13
## 2     3     3 c     FALSE odd   odd   2022-02-08 23:47:07 2022-02-11
## 3     1     1 a     FALSE odd   odd   2022-02-08 23:47:05 2022-02-09
## 4     6     6 f     TRUE  even  even  2022-02-08 23:47:10 2022-02-14
## 5     4     4 d     TRUE  even  even  2022-02-08 23:47:08 2022-02-12
## 6     2     2 b     TRUE  even  even  2022-02-08 23:47:06 2022-02-10
#
stopifnot(identical(bb_2, arrange(bb, ll, -dd)))

Validation


3 Data Manipulation (B11, Sep-12)

3.1 Overview

3.2 Get Help

# #To get the Help files on any Topic including 'loaded' Packages
?dplyr
?mutate
# #Help files on any Topic including functions from 'installed' but 'not loaded' Packages
?dplyr::mutate()
# #Operators need Backticks i.e. ` . In keyboards it is located below 'Esc' Key
?`:`
# #To Get the list of All Options used by Base R (including user defined)
?options

3.3 Logical Operators and Functions

  • |”      (Or, binary, vectorized)
  • ||”     (Or, binary, not vectorized)
  • &”     (And, binary, vectorized)
  • &&” (And, binary, not vectorized)
  • Functions - any(), all()

Overview

  • Vectorised forms are “&” “|”
    • Thus, these compare vectors elementwise and operate over complete vector length.
    • NA is a valid logical object. Where a component of x or y is NA, the result will be NA if the outcome is ambiguous.
    • All components of x or y are evaluated
    • (recycling) of elements occur if vector lengths are different
    • These are NOT recommended for use inside if() clauses
    • These are generally used for filtering
    • &, | do the pairwise operation in R (vs bitwise in Python, C etc.)
  • Non-vectorised forms are “&&” “||”
    • These examine only the first element of each vector
      • Caution: For these, vector length should always be 1
      • Use all() and any() to reduce the length to one
    • (short-circuit) These stop execution as soon as these find at least one definite condition i.e. TRUE for ||, FALSE for &&.
      • They will not evaluate the second operand if the first operand is enough to determine the value of the expression.
    • These are preferred in if() clauses
    • &&, || do the bitwise operation in R (vs pairwise in Python, C etc.)
  • all() and any()
    • all() : Are All Values TRUE
      • TRUE for 0-length vector
    • any() : Is at least one of the values TRUE
      • FALSE for 0-length vector
    • The value is a logical vector of length one being TRUE, FALSE, or NA.

Operators

# #At lease one TRUE is present
NA | TRUE
## [1] TRUE
# #Depending upon what the unknown is, the outcome will change
NA | FALSE
## [1] NA
# #Depending upon what the unknown is, the outcome will change
NA & TRUE
## [1] NA
# #At lease one FALSE is present
NA & FALSE 
## [1] FALSE
#
# #For length 1 vectors, output of vectorised and non-vectorised forms is same
stopifnot(all(identical(NA || TRUE, NA | TRUE), identical(NA || FALSE, NA | FALSE),
              identical(NA && TRUE, NA & TRUE), identical(NA && FALSE, NA & FALSE)))
#
# #But for vectors of >1 length, output is different
x <- 1:5
y <- 5:1
(x > 2) & (y < 3)
## [1] FALSE FALSE FALSE  TRUE  TRUE
(x > 2) && (y < 3)
## [1] FALSE
#
# # '&&' evaluates only the first element of Vector, thus caution is advised
TRUE & c(TRUE, FALSE)
## [1]  TRUE FALSE
TRUE & c(FALSE, FALSE)
## [1] FALSE FALSE
TRUE && c(TRUE, FALSE)
## [1] TRUE
TRUE && c(FALSE, FALSE)
## [1] FALSE
TRUE && all(c(TRUE, FALSE))
## [1] FALSE
TRUE && any(c(TRUE, FALSE))
## [1] TRUE

Evaluation

if(exists("x")) rm(x)
exists("x")
## [1] FALSE
#
# # No short-circuit for "|" or "&", Evaluates Right and throws Error
tryCatch( TRUE | x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found
tryCatch( FALSE & x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found
#
# #Does not evaluate Right input because outcome already determined
tryCatch( TRUE || x, error = function(e) cat(paste0(e)))
## [1] TRUE
tryCatch( FALSE && x, error = function(e) cat(paste0(e)))
## [1] FALSE
# #evaluates Right input because outcome cannot be determined and throws error
tryCatch( TRUE && x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found

AnyAll

# #any()
any(NA, TRUE)
## [1] TRUE
any(NA, FALSE)
## [1] NA
any(NA, TRUE, na.rm = TRUE)
## [1] TRUE
any(NA, FALSE, na.rm = TRUE)
## [1] FALSE
any(character(0))
## [1] FALSE
#
# #all()
all(NA, TRUE)
## [1] NA
all(NA, FALSE)
## [1] FALSE
all(NA, TRUE, na.rm = TRUE)
## [1] TRUE
all(NA, FALSE, na.rm = TRUE)
## [1] FALSE
all(character(0))
## [1] TRUE

3.4 Relational Operators

\(>\) , \(<\) , \(==\) , \(>=\) , \(<=\) , \(!=\)

3.5 Filter

  • dplyr::filter()
  • subset() vs. filter() -
    • Caution: R Manual itself warns against usage of subset(). It is better to use [] for subsetting
    • Caution: NOT Verified Yet
      • subset works on matrices, however, filter does not
      • subset does not work on databases, filter does
      • subset does not drop the rownames, however, filter removes them
      • filter preserves the class of the column, subset does not
      • filter works with grouped data, subset ignores them
    • filter is stricter and thus would lead to fewer causes of unexpected outcome
  • which()
    • Takes a Boolean vector and returns a shorter vector containing the indices of the elements which were true.
    • If you want to know ‘which’ elements of a logical vector are TRUE i.e. their indices.
      • Ex: Get the position of the maximum or minimum values
    • If NA are present and you do not want them in the output
  • with()
    • with() is a wrapper for functions with no ‘data’ argument. It allows usage of function as if it had a data argument.
ERROR 3.1 Error in match.arg(method) : object ’day’ not found
  • when ‘dplyr’ package is not loaded, base::filter() throws this error.
  • Either Load the Package (dplyr etc.) or use scope resolution ‘::’

Basics

# #dplyr::filter() - Filter Rows based on Multiple Columns
bb_1 <- filter(bb, month == 1, day == 1)
dim(bb_1)
## [1] 842  19
# #Filtering by multiple criteria within a single logical expression
stopifnot(identical(bb_1, filter(bb, month == 1 & day == 1)))
#
if(anyNA(bb_1)) {
  bb_na <- na.omit(bb_1)
  print(paste0("Note: ", length(attributes(bb_na)$na.action), " rows removed."))
} else {
  print("NA not found")
}
## [1] "Note: 11 rows removed."
dim(bb_na)
## [1] 831  19

Conditional

dim(bb)
## [1] 336776     19
#
# #Flights in either months of November or Decemeber
dim(bb_2 <- filter(bb, month == 11 | month == 12))
## [1] 55403    19
#
# #Flights with arrival delay '<= 120' or departure delay '<= 120' 
# #It excludes flights where arrival & departure BOTH are delayed by >2 hours
# #If either delay is less than 2 hours, the flight is included
dim(bb_3 <- filter(bb, arr_delay <= 120 | dep_delay <= 120))
## [1] 320060     19
dim(bb_4 <- filter(bb, !(arr_delay > 120 & dep_delay > 120)))
## [1] 320060     19
dim(bb_5 <- filter(bb, (!arr_delay > 120 | !dep_delay > 120)))
## [1] 320060     19
#
# #Destination to IAH or HOU
dim(bb_6 <- filter(bb, dest == "IAH" | dest == "HOU"))
## [1] 9313   19
dim(bb_7 <- filter(bb, dest %in% c("IAH", "HOU")))
## [1] 9313   19
#
# #Carrier being "UA", "US", "DL"
dim(bb_8 <- filter(bb, carrier == "UA" | carrier == "US" | carrier == "DL"))
## [1] 127311     19
dim(bb_9 <- filter(bb, carrier %in% c("UA", "US", "DL")))
## [1] 127311     19
#
# #Did not leave late (before /on time departure) but Arrived late by >2 hours
dim(bb_10 <- filter(bb, (arr_delay > 120) & !(dep_delay > 0)))
## [1] 29 19
# 
# #Departed between midnight and 6 AM (inclusive)
dim(bb_11 <- filter(bb, (sched_dep_time >= 00 & sched_dep_time <= 600)))
## [1] 8970   19

subset()

# #subset() - Recommendation is against its usage. Use either '[]' or filter()
dim(bb_12 <- subset(bb, month == 1 | !(dep_delay >= 120), 
                    select = c("flight", "arr_delay")))
## [1] 319760      2
dim(bb_13 <- subset(bb, month == 1 | !(dep_delay >= 120) | carrier == "DL", 
                select = c("flight", "arr_delay")))
## [1] 321139      2

Filter Rows

# #Data: mtcars, 32x11, "mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb"
bb <- aa <- mtcars
#str(bb)
#summary(bb)
# #
# #Avoid subset()
ii <- subset(bb, wt > 2 & wt < 3)
# #which() 
jj <- bb[which(bb$wt > 2 & bb$wt <= 3), ]
#
# #which() select only TRUE and NOT the NA
(1:2)[which(c(TRUE, NA))]
## [1] 1
(1:2)[c(TRUE, NA)]
## [1]  1 NA
#
# #which() is faster than head() 
ee <- bb[which(bb$wt > 2 & bb$wt <= 3)[1:6], ] 
ff <- head(bb[bb$wt > 2 & bb$wt <= 3, ], 6)
stopifnot(identical(ee, ff))
#
# #Normal Filter using [] operator
kk <- bb[bb$wt > 2 & bb$wt <= 3, ]
#
# #with()
ll <- with(bb, bb[wt > 2 & wt <= 3, ])
#
# #filter()
mm <- bb %>% filter(wt > 2 & wt <= 3)
#
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm)))
#
# #Another set of equivalent operations for OR 
ii <- subset(bb, cyl == 4 | cyl == 6)
jj <- bb[bb$cyl %in% c(4, 6), ]
kk <- bb[which(bb$cyl %in% c(4, 6)), ]
ll <- bb %>% filter(cyl == 4 | cyl == 6)
mm <- bb %>% filter(cyl %in% c(4, 6))
#
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm)))
#
# #General Conditional Subsetting on Flights data
bb <- xxflights
dim(bb)
## [1] 336776     19
#
dim(bb[which(bb$day == 1 & !(bb$month ==1)), ])
## [1] 10194    19
dim(bb[which(bb$day == 1 | bb$month ==1), ])
## [1] 37198    19
dim(bb[which(bb$day == 1 & bb$month ==1), ])
## [1] 842  19
dim(bb[which(bb$day == 1, bb$month ==1), ])
## [1] 11036    19
dim(bb[which(bb$day == 1 & !(bb$carrier == "DL")), ])
## [1] 9482   19
dim(bb[which(bb$day == 1 | bb$carrier == "DL"), ])
## [1] 57592    19
dim(bb[which(bb$day == 1 & bb$carrier == "DL"), ])
## [1] 1554   19
dim(bb[which(bb$day == 1, bb$carrier == "DL"), ])
## [1] 11036    19

3.6 Subsetting

\([ \ \ ]\) , \([[ \ \ ]]\) , \(\$\)

  • Extract or Replace Parts of an Object
    • Operators acting on vectors, matrices, arrays and lists to extract or replace parts.
    • The most important distinction between “[ ],” “[[ ]]” and “$” is that the “[ ]” can select more than one element whereas the other two select a single element.
    • “$” does not allow computed indices, whereas “[[ ]]” does.
    • Subsetting (except by an empty index) will drop all attributes except names, dim and dimnames. Indexing will keep them.
ERROR 3.2 Error in day == 1 : comparison (1) is possible only for atomic and list types
  • It occurs when the data is not available i.e. column name is NOT found
  • It might happen when the original code assumed that the dataframe is attached
  • Either attach the dataframe (NOT Recommended) or use “$” to access column names

dplyr::select()

  • It can use Range “:” Not “!” And “&,” Or “|
  • Selection Helpers
    • everything(): Matches all variables.
    • last_col(): Select last variable, possibly with an offset.
  • These helpers select variables by matching patterns in their names:
    • starts_with(): Starts with a prefix.
    • ends_with(): Ends with a suffix.
    • contains(): Contains a literal string.
    • matches(): Matches a regular expression.
    • num_range(): Matches a numerical range like x01, x02, x03.
  • These helpers select variables from a character vector:
    • all_of(): Matches variable names in a character vector. All names must be present, otherwise an out-of-bounds error is thrown.
    • any_of(): Same as all_of(), except that no error is thrown for names that do not exist.
  • This helper selects variables with a function:
    • where(): Applies a function to all variables and selects those for which the function returns TRUE.

Cols

dim(bb)
## [1] 336776     19
#
# #Subset Consecutive Columns using Colon
stopifnot(identical(bb[ , 2:5], bb[ , -c(1, 6:ncol(bb))]))
#
# #dplyr::select()
bb_14 <- select(bb, year:day, arr_delay, dep_delay, distance, air_time)
bb_15 <- bb %>% select(year:day, arr_delay, dep_delay, distance, air_time)
stopifnot(identical(bb_14, bb_15))

3.7 Grouped Summary

  • dplyr::summarise() or dplyr::summarize()
  • dplyr::group_by()
    • It converts an existing Tibble into a grouped Tibble where operations are performed “by group.”
    • ungroup() removes grouping.
    • n() gives the number of observations in the current group.

Summarise

bb <- xxflights
# #dplyr::summarise() & dplyr::summarize() are same
# #Get the mean of a column with NA excluded
#
summarize(bb, delay_mean = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
##   delay_mean
##        <dbl>
## 1       12.6
#
# #base::summary()
summary(bb$dep_delay)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -43.00   -5.00   -2.00   12.64   11.00 1301.00    8255
#
# #Grouped Summary
by_ymd <- group_by(bb, year, month, day)
mysum <- summarize(by_ymd, 
                   dep_delay_mean = mean(dep_delay, na.rm = TRUE), 
                   arr_delay_mean = mean(arr_delay, na.rm = TRUE),
                   .groups = "keep")
# #Equivalent 
bb %>% 
  group_by(year, month, day) %>% 
  summarize(dep_delay_mean = mean(dep_delay, na.rm = TRUE), 
            arr_delay_mean = mean(arr_delay, na.rm = TRUE),
            .groups= "keep")
## # A tibble: 365 x 5
## # Groups:   year, month, day [365]
##     year month   day dep_delay_mean arr_delay_mean
##    <dbl> <dbl> <dbl>          <dbl>          <dbl>
##  1  2013     1     1          11.5          12.7  
##  2  2013     1     2          13.9          12.7  
##  3  2013     1     3          11.0           5.73 
##  4  2013     1     4           8.95         -1.93 
##  5  2013     1     5           5.73         -1.53 
##  6  2013     1     6           7.15          4.24 
##  7  2013     1     7           5.42         -4.95 
##  8  2013     1     8           2.55         -3.23 
##  9  2013     1     9           2.28         -0.264
## 10  2013     1    10           2.84         -5.90 
## # ... with 355 more rows

group_by()

# #Get delay grouped by distance 'Distance between airports, in miles.'
summary(bb$distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      17     502     872    1040    1389    4983
#
# #How many unique values are present in this numeric data i.e. factors
str(as.factor(bb$distance))
##  Factor w/ 214 levels "17","80","94",..: 163 165 145 171 106 96 138 22 120 99 ...
str(sort(unique(bb$distance)))
##  num [1:214] 17 80 94 96 116 143 160 169 173 184 ...
bb %>% 
  group_by(distance) %>% 
  summarize(count = n(),
            dep_delay_mean = mean(dep_delay, na.rm = TRUE), 
            arr_delay_mean = mean(arr_delay, na.rm = TRUE),
            .groups= "keep")
## # A tibble: 214 x 4
## # Groups:   distance [214]
##    distance count dep_delay_mean arr_delay_mean
##       <dbl> <int>          <dbl>          <dbl>
##  1       17     1         NaN           NaN    
##  2       80    49          18.9          16.5  
##  3       94   976          17.5          12.7  
##  4       96   607           3.19          5.78 
##  5      116   443          17.7           7.05 
##  6      143   439          23.6          14.4  
##  7      160   376          21.8          16.2  
##  8      169   545          18.5          15.1  
##  9      173   221           7.05         -0.286
## 10      184  5504           3.07          0.123
## # ... with 204 more rows
#
# #For distance =17, there is only 1 flight and that too has NA, so the mean is NaN
bb[bb$distance == 17, ]
## # A tibble: 1 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##   <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>          <dbl>     <dbl> <chr>  
## 1  2013     7    27       NA            106        NA       NA            245        NA US     
## # ... with 9 more variables: flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#
# #In general, Flight to any destination (ex: ABQ) has travelled same distance (1826)
unique(bb %>% filter(dest == "ABQ") %>% select(distance))
## # A tibble: 1 x 1
##   distance
##      <dbl>
## 1     1826
#
# #Mean Delays for Destinations with more than 1000 miles distance
bb %>% 
  group_by(dest) %>% 
  filter(distance > 1000) %>% 
  summarize(count = n(), 
            distance_mean = mean(distance, na.rm = TRUE),
            dep_delay_mean = mean(dep_delay, na.rm = TRUE), 
            arr_delay_mean = mean(arr_delay, na.rm = TRUE))
## # A tibble: 48 x 5
##    dest  count distance_mean dep_delay_mean arr_delay_mean
##    <chr> <int>         <dbl>          <dbl>          <dbl>
##  1 ABQ     254         1826           13.7           4.38 
##  2 ANC       8         3370           12.9          -2.5  
##  3 AUS    2439         1514.          13.0           6.02 
##  4 BQN     896         1579.          12.4           8.25 
##  5 BUR     371         2465           13.5           8.18 
##  6 BZN      36         1882           11.5           7.6  
##  7 DEN    7266         1615.          15.2           8.61 
##  8 DFW    8738         1383.           8.68          0.322
##  9 DSM     569         1021.          26.2          19.0  
## 10 EGE     213         1736.          15.5           6.30 
## # ... with 38 more rows

3.8 Mutate

  • dplyr::mutate()
    • Newly created variables are available immediately
    • New variables overwrite existing variables of the same name.
    • Variables can be removed by setting their value to NULL.
    • mutate() adds new variables and preserves existing ones
      • mutate() can also keep or drop column according to the .keep argument.
    • transmute() adds new variables and drops existing ones.
ERROR 3.3 Error in UseMethod("select") : no applicable method for ’select’ applied to an object of class "function"
  • Run ‘str(MyObject)’ to check if ‘MyObject’ exists, looks as expected and R is not finding something else.
  • Most probably R reserved keyword ‘data’ was called in place of the actual ‘data.’
  • To minimise this type of Error, do not use the keywords which match with Base R Functions e.g. ‘data’ (Function in utils) or ‘df’ (function in stats)
ERROR 3.4 Error: Problem with mutate() column ... column object ’arr_delay’ not found
  • Run ‘str(MyObject)’ to check if the column exists in the dataset
  • Caution: if the dataset was attached earlier, then R will NOT throw this error. However, later when the code is being executed in a clean environment, it will fail. To avoid this, it is recommended to use proper scope resolution and to avoid attaching the dataset (if possible)
dim(bb)
## [1] 336776     19
#
bb_16 <- select(bb, year:day, arr_delay, dep_delay, distance, air_time)
bb_17 <- mutate(bb_16,
       gain = arr_delay - dep_delay,
       speed = distance / air_time * 60,
       hours = air_time / 60,
       gain_per_hour = gain / hours)
# #Equivalent
bb %>% 
  select(year:day, arr_delay, dep_delay, distance, air_time) %>% 
  mutate(gain = arr_delay - dep_delay,
         speed = distance / air_time * 60,
         hours = air_time / 60,
         gain_per_hour = gain / hours)
## # A tibble: 336,776 x 11
##     year month   day arr_delay dep_delay distance air_time  gain speed hours gain_per_hour
##    <dbl> <dbl> <dbl>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl>         <dbl>
##  1  2013     1     1        11         2     1400      227     9  370. 3.78           2.38
##  2  2013     1     1        20         4     1416      227    16  374. 3.78           4.23
##  3  2013     1     1        33         2     1089      160    31  408. 2.67          11.6 
##  4  2013     1     1       -18        -1     1576      183   -17  517. 3.05          -5.57
##  5  2013     1     1       -25        -6      762      116   -19  394. 1.93          -9.83
##  6  2013     1     1        12        -4      719      150    16  288. 2.5            6.4 
##  7  2013     1     1        19        -5     1065      158    24  404. 2.63           9.11
##  8  2013     1     1       -14        -3      229       53   -11  259. 0.883        -12.5 
##  9  2013     1     1        -8        -3      944      140    -5  405. 2.33          -2.14
## 10  2013     1     1         8        -2      733      138    10  319. 2.3            4.35
## # ... with 336,766 more rows

Validation


4 Statistics (B12, Sep-26)

4.2 Definitions

23.20 A population is the set of all elements of interest in a particular study.

23.23 The process of conducting a survey to collect data for the entire population is called a census.

23.21 A sample is a subset of the population.

29.7 A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.

4.3 Inferential Statistics

23.25 Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.

Inferential statistics are used for Hypothesis Testing. Refer Statistical Inference

4.4 Hypothesis Testing

Refer Hypothesis Testing

31.1 Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.

31.2 Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.

31.3 Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.

Refer Steps of Hypothesis Testing

  1. State the NULL Hypothesis \({H_0}\)
    • The null will always be in the form of decisions regarding the population, not the sample.
      • If we have population data, we can do the census and then there is no requirement of any hypothesis or estimation.
    • The Null Hypothesis will always be written as the absence of some parameter or process characteristic
      • The test is designed to assess the strength of the evidence against the null hypothesis.
      • Often the null hypothesis is a statement of “no difference.”
    • Equality part of expression always appears in \({H_0}\) i.e. it can be \(>=\) , \(<=\) , \(==\)
    • The term ‘null’ is used because this hypothesis assumes that there is no difference between the two means or that the recorded difference is not significant.
  2. An Alternative Hypothesis \({H_a}\), is then stated which will be the complement of the Null Hypothesis.
    • \({H_a}\) cannot have equality part of expression i.e. it can be \(<\) , \(>\) , \(!=\)
    • The claim about the population that evidence is being sought for is the alternative hypothesis
      • However, to prove it is true, its complement (null hypothesis) is tried to be proven false. Because it is easier to prove something false.
  3. For Hypothesis tests involving a population mean, let \({\mu}_0\) denote the hypothesized value

31.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)

31.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

31.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

  • Sample data is used to determine whether or not you can be statistically confident that you can reject or fail to reject the \({H_0}\).
    • If the \({H_0}\) is rejected, the statistical conclusion is that the \({H_a}\) is TRUE.
  • Notes:
    • Sometimes it is easier to formulate the alternative hypothesis (the conclusion that you hope to support) and create NULL hypothesis based on that.
    • Ex: If we are testing for validity of the claim that number of defects are less than 2%
      • \({H_a} : {\mu} < 2\% \iff {H_0} : {\mu} \geq 2\%\)
      • If the \({H_0}\) is rejected, then the statistical conclusion is that the \({H_a}\) is TRUE i.e. defects are less than 2% in the population
      • If the \({H_0}\) is not rejected, then no conclusion can be formed about the \({H_a}\).

Question: Is there an ideal sample size

  • NO
  • (“ForLater”) However, there exists a relationship between (I guess) alpha, beta and sample size n. (I could not find the link on later search.)
  • (Paraphrasing and only memory based so can be worng!) Basically, for a given analysis, if we want to keep both types of errors to a managable level, we can calculate minimum number of samples that would help us in determining the outcome at a certain minimum confidence level etc.

4.5 Point Estimation

29.9 To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.

29.10 A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)

29.11 The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.

Example: \({\overline{x}}\) is an estimator (of populataion parameter ‘mean’ \({\mu}\)). Its estimate is 3 and this calculation process is an estimation.

4.6 Standard Deviation

25.8 Given a data set \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({{x}_1, {x}_2, \ldots, {x}_n}\) divided by the count \({n}\).

Refer Standard Deviation and equation (25.12)

25.15 The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.

\[\begin{align} \sigma &= \sqrt{\frac{1}{N} \sum_{i=1}^N \left(x_i - \mu\right)^2} \\ {s} &= \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \overline{x}\right)^2} \end{align}\]

A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.

4.7 Variance

Refer Variance and equation (25.11)

25.14 The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.

\[\begin{align} \sigma^2 &= \frac{1}{n} \sum _{i=1}^{n} \left(x_i - \mu \right)^2 \\ s^2 &= \frac{1}{n-1} \sum _{i=1}^{n} \left(x_i - \overline{x} \right)^2 \end{align}\]

Variability is most commonly measured with the Range, IQR, SD, and Variance.

4.8 Standard Error or Sampling Fluctuation

The sample we draw from the population is only one from a large number of potential samples.

  • If ten researchers were all studying the same population, drawing their own samples then they may obtain different answers i.e. each of the ten researchers may come up with a different mean
  • Thus, the statistic in question (mean) varies for sample to sample. It has a distribution called a sampling distribution.
  • We can use this distribution to understand the uncertainty in our estimate of the population parameter.

Refer Standard Error

29.13 In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’

29.14 A sampling error is the difference between a population parameter and a sample statistic.

Sampling fluctuation (Standard Error) refers to the extent to which a statistic (mean, median, mode, sd etc.) takes on different values with different samples i.e. it refers to how much the value of the statistic fluctuates from sample to sample.

29.12 The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).

Standard Deviation of \({\overline{x}}\), \(\sigma_{\overline{x}}\) is given by equation (29.1) i.e. \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\)

  • Generally, the standard error is unknown.
  • Higher the standard error, higher the deviation from sample to sample i.e. lower the reliability.

4.9 Test Statistic

Refer Test Statistic

31.11 Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.

31.14 If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (31.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)

4.10 Calculate SD & SE

Standard Error (SE) is same as ‘the standard deviation of the sampling distribution.’ The ‘variance of the sampling distribution’ is the Variance of the data divided by N.

Calculate Statistics

# #DataSet: Height of 5 people in 'cm'
hh <- c(170.5, 161, 160, 170, 150.5)
#
# #N by length()
print(hh_len <- length(hh))
## [1] 5
#
# #Mean by mean()
hh_mean <- mean(hh)
cat("Mean = ", hh_mean)
## Mean =  162.4
#
# #Variance by var()
hh_var <- round(var(hh), 3)
cat("Variance = ", hh_var)
## Variance =  68.175
#
# #Standard Deviation (SD) by sd()
hh_sd <- round(sd(hh), 3)
cat("Standard Deviation (SD) = ", hh_sd)
## Standard Deviation (SD) =  8.257
#
# #Standard Error (SE) 
hh_se_sd <- round(hh_sd / sqrt(hh_len), 3)
cat("Standard Error (SE) = ", hh_se_sd)
## Standard Error (SE) =  3.693

R Functions

# #DataSet: Height of 5 people in 'cm'
print(hh)
## [1] 170.5 161.0 160.0 170.0 150.5
#
# #N by length()
print(hh_len <- length(hh))
## [1] 5
#
# #sum by sum()
print(hh_sum <- sum(hh))
## [1] 812
#
# #Mean by mean()
hh_mean <- mean(hh)
hh_mean_cal <- hh_sum / hh_len
stopifnot(identical(hh_mean, hh_mean_cal))
cat("Mean = ", hh_mean)
## Mean =  162.4
#
# #Calculate the deviation from the mean by subtracting each value from the mean
print(hh_dev <- hh - hh_mean)
## [1]   8.1  -1.4  -2.4   7.6 -11.9
#
# #Square the deviation
print(hh_sqdev <- hh_dev^2)
## [1]  65.61   1.96   5.76  57.76 141.61
#
# #Get Sum of the squared deviations
print(hh_sqdev_sum <- sum(hh_sqdev))
## [1] 272.7
#
# #Divide it by the 'sample size (N) - 1' for the Variance or use var()
hh_var <- round(var(hh), 3)
hh_var_cal <- hh_sqdev_sum / (hh_len -1)
stopifnot(identical(hh_var, hh_var_cal))
cat("Variance = ", hh_var)
## Variance =  68.175
#
# #Variance of the sampling distribution 
hh_var_sample <- hh_var / hh_len
cat("Variance of the Sampling Distribution = ", hh_var)
## Variance of the Sampling Distribution =  68.175
#
# #Take square root of the Variance for the Standard Deviation (SD) or use sd()
hh_sd_cal <- round(sqrt(hh_var), 3)
hh_sd <- sd(hh)
stopifnot(identical(round(hh_sd, 3), hh_sd_cal))
cat("Standard Deviation (SD) = ", hh_sd)
## Standard Deviation (SD) =  8.256815
#
# #Standard Error (SE)
# #SE
# #Divide the SD by the square root of the sample size for the Standard Error (SE)
# #
hh_se_sd <- round(hh_sd / sqrt(hh_len), 3)
#
# #Calculate SE from Variance 
hh_se_var <- round(sqrt(hh_var_sample), 3)
stopifnot(identical(hh_se_sd, hh_se_var))
cat("Standard Error (SE) = ", hh_se_sd)
## Standard Error (SE) =  3.693

4.11 Histogram and Density

Using Dataset Flights : “air_time” -Amount of time spent in the air, in minutes. Refer figure 4.1

Graphs

(B12P01 B12P02) Flights: Air Time (min) excluding NA (Histogram and Density)(B12P01 B12P02) Flights: Air Time (min) excluding NA (Histogram and Density)

Figure 4.1 (B12P01 B12P02) Flights: Air Time (min) excluding NA (Histogram and Density)

NA

# #Remove All NA
aa <- na.omit(xxflights$air_time)
attr(aa, "na.action") <- NULL
str(aa)
##  num [1:327346] 227 227 160 183 116 150 158 53 140 138 ...
summary(aa)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    82.0   129.0   150.7   192.0   695.0

Stats

# #Overview of Data after removal of NA
bb <- aa
stopifnot(is.null(dim(bb)))
summary(bb)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    82.0   129.0   150.7   192.0   695.0
# #min(), max(), range(), summary()
min_bb <- summary(bb)[1]
max_bb <- summary(bb)[6]
range_bb <- max_bb - min_bb
cat(paste0("Range = ", range_bb, " (", min_bb, ", ", max_bb, ")\n"))
## Range = 675 (20, 695)
# #IQR(), summary()
iqr_bb <- IQR(bb)
cat(paste0("IQR = ", iqr_bb, " (", summary(bb)[2], ", ", summary(bb)[5], ")\n"))
## IQR = 110 (82, 192)
# #median(), mean(), summary()[3], summary()[4] 
median_bb <- median(bb)
cat("Median =", median_bb, "\n")
## Median = 129
mu_mean_bb <- mean(bb)
cat("Mean \u03bc =", mu_mean_bb, "\n")
## Mean µ = 150.6865
#
sigma_sd_bb <- sd(bb)
cat("SD (sigma) \u03c3 =", sigma_sd_bb, "\n")
## SD (sigma) s = 93.6883
#
variance_bb <- var(bb)
cat(sprintf('Variance (sigma)%s %s%s =', '\u00b2', '\u03c3', '\u00b2'), variance_bb, "\n")
## Variance (sigma)² s² = 8777.498

Historgram

# #Histogram
bb <- na.omit(xxflights$air_time)
hh <- tibble(ee = bb)
# #Basics
median_hh <- round(median(hh[[1]]), 1)
mean_hh <- round(mean(hh[[1]]), 1)
sd_hh <- round(sd(hh[[1]]), 1)
len_hh <- nrow(hh)
#
B12P01 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean_hh), color = '#440154FF') +
  geom_text(data = tibble(x = mean_hh, y = -Inf, 
                          label = paste0("Mean= ", mean_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
  geom_vline(aes(xintercept = median_hh), color = '#3B528BFF') +
  geom_text(data = tibble(x = median_hh, y = -Inf, 
                          label = paste0("Median= ", median_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#3B528BFF', hjust = -0.5, vjust = -0.7, angle = 90) +
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Frequency", 
       subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh, 
                         "; Median= ", median_hh, "; SD= ", sd_hh,
                         ")"), 
        caption = "B12P01", title = "Flights: Air Time")
}

Density

# #Density Curve
# #Get Quantiles and Ranges of mean +/- sigma 
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
#
# # Latex Labels 
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'), 
                TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'), 
                TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
B12P02 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_density(alpha = 0.2, colour = "#21908CFF") + 
  geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  #scale_y_continuous(limits = c(0, 0.009), breaks = seq(0, 0.009, 0.003)) +
  scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) + 
  ggplot2::annotate("segment", x = xpoints_hh[4] - 0.5 * sd_hh, xend = xpoints_hh[2], y = 0.007, 
                    yend = 0.007, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) + 
  ggplot2::annotate("segment", x = xpoints_hh[4] + 0.5 * sd_hh, xend = xpoints_hh[6], y = 0.007, 
                    yend = 0.007, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) + 
  ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.007, label = "95.4%") + 
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Density", 
       subtitle = paste0("(N=", nrow(.), "; ", "Mean= ", round(mean(.[[1]]), 1), 
                         "; Median= ", round(median(.[[1]]), 1), "; SD= ", round(sd(.[[1]]), 1),
                         ")"), 
        caption = "B12P02", title = "Flights: Air Time")
}

Aside

  • This section is NOT useful for general reader and can be safely ignored. It contains my notes related to building this book. These are useful only for someone who is building his own book. (Shivam)
  • Side by Side Images need a Caption in Final Chunk
  • LaTex Inside Tex() will not be able to execute braces as usual, avoid them or escape them

4.12 Effect of Sample Size and Repeat Sampling

Using Dataset Flights : “air_time” -Amount of time spent in the air, in minutes.

  1. Effect of increasing sample size (N =100, 1000, 10000), Refer figure 4.2
    • the precision and confidence in the estimate increases and uncertainty decreases
    • the distribution of sample means become thinner. i.e. the sample standard deviation decreases
  2. Effect of increasing the Sampling, Refer figure 4.4
    • The mean of the distribution of sample means equals the mean of the parent distribution.
    • Refer Standard Error

Caution: Trend here does not match with the theory. However, the exercise shows the ‘How to do it’ part. It can be repeated with better data, larger sample size, or repeat sampling.

4.12.1 Sample Size

GIF

(B12P03 B12P04 B12P05) Effect of Increasing Sample Size

Figure 4.2 (B12P03 B12P04 B12P05) Effect of Increasing Sample Size

Images

(B12P03 B12P04 B12P05) Effect of Increasing Sample Size(B12P03 B12P04 B12P05) Effect of Increasing Sample Size(B12P03 B12P04 B12P05) Effect of Increasing Sample Size

Figure 4.3 (B12P03 B12P04 B12P05) Effect of Increasing Sample Size

Code

bb <- na.omit(xxflights$air_time)
# #Fix Seed
set.seed(3)
# #Set Sample Size
#nn <- 100L
# #Take a sample from dataset
xb100 <- sample(bb, size = 100L)
xb1000 <- sample(bb, size = 1000L)
xb10000 <- sample(bb, size = 10000L)
# #Population Mean
mu_hh <- round(mean(bb), 1)
# #Histogram: N = 100
hh <- tibble(ee = xb100)
ylim_hh <- 12.5
cap_hh <- "B12P03"
# #Assumes 'hh' has data in 'ee'. In: mu_hh, cap_hh, ylim_hh
#
B12 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
  geom_text(aes(label = TeX(r'($\bar{x}$)', output = "character"), 
                x = mean(.data[["ee"]]), y = -Inf), 
            color = '#440154FF', hjust = 2, vjust = -2.5, parse = TRUE, check_overlap = TRUE) + 
  geom_vline(aes(xintercept = mu_hh), color = '#3B528BFF') +
  geom_text(aes(label = TeX(r'($\mu$)', output = "character"), x = mu_hh, y = -Inf),
            color = '#3B528BFF', hjust = -1, vjust = -2, parse = TRUE, check_overlap = TRUE) + 
  coord_cartesian(xlim = c(0, 800), ylim = c(0, ylim_hh)) + 
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Frequency", 
       subtitle = paste0("(Mean= ", round(mean(.[[1]]), 1), 
                         "; SD= ", round(sd(.[[1]]), 1),
                         #"; Var= ", round(var(.[[1]]), 1),
                         "; SE= ", round(sd(.[[1]]) / sqrt(nrow(.)), 1),
                         ")"), 
      caption = cap_hh, title = paste0("Sample Size = ", nrow(.)))
}
assign(cap_hh, B12)
rm(B12)

Warnings

  • “In mean.default(gg) : argument is not numeric or logical: returning NA”
    • For ggplot() - This comes up if the object ‘gg’ is NULL. Check if the ggplot is looking into global scope in place of local dataframe that was passed.

Deprecated

4.12.2 Repeat Sampling

GIF

(B12P06 B12P07 B12P08) Effect of Increasing Sample Size

Figure 4.4 (B12P06 B12P07 B12P08) Effect of Increasing Sample Size

Images

(B12P06 B12P07 B12P08) Effect of Increasing Sampling(B12P06 B12P07 B12P08) Effect of Increasing Sampling(B12P06 B12P07 B12P08) Effect of Increasing Sampling

Figure 4.5 (B12P06 B12P07 B12P08) Effect of Increasing Sampling

Code

bb <- na.omit(xxflights$air_time)
# #Fix Seed
set.seed(3)
# #Set Sample Size
nn <- 10L
# #Set Repeat Sampling Rate
rr <- 20L
# #Take Sample of N = 10, get mean, repeat i.e. get distribution of mean
xr20 <- replicate(rr, mean(sample(bb, size = nn)))
rr <- 200L
xr200 <- replicate(rr, mean(sample(bb, size = nn)))
rr <- 2000L
xr2000 <- replicate(rr, mean(sample(bb, size = nn)))
#
# #Population Mean
mu_hh <- round(mean(bb), 1)
# #Histogram: N = 10, Repeat = 20
hh <- tibble(ee = xr20)
ylim_hh <- 2
cap_hh <- "B12P06"
# #Assumes 'hh' has data in 'ee'. In: mu_hh, cap_hh, ylim_hh, nn
#
B12 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
  geom_text(aes(label = TeX(r'($E(\bar{x})$)', output = "character"), 
                x = mean(.data[["ee"]]), y = -Inf), 
            color = '#440154FF', hjust = 1.5, vjust = -1.5, parse = TRUE, check_overlap = TRUE) + 
  geom_vline(aes(xintercept = mu_hh), color = '#3B528BFF') +
  geom_text(aes(label = TeX(r'($\mu$)', output = "character"), x = mu_hh, y = -Inf),
            color = '#3B528BFF', hjust = -1, vjust = -2, parse = TRUE, check_overlap = TRUE) + 
  coord_cartesian(xlim = c(0, 800), ylim = c(0, ylim_hh)) + 
  theme(plot.title.position = "panel") + 
  labs(x = TeX(r'($\bar{x} \, (\neq x)$)'), y = TeX(r'(Frequency of $\, \bar{x}$)'), 
       subtitle = TeX(sprintf(
         "($\\mu$=%.0f) $E(\\bar{x}) \\, =$%.0f $\\sigma_{\\bar{x}} \\, =$%.0f",
                             mu_hh, round(mean(.[[1]]), 1), round(sd(.[[1]])))),
       caption = cap_hh, 
       title = paste0("Sampling Distribution (N = ", nn, ") & Repeat Sampling = ", nrow(.)))
}
assign(cap_hh, B12)
rm(B12)

4.13 Normal Distribution

(B12P09) Normal Distribution

Figure 4.6 (B12P09) Normal Distribution

Refer Normal Distribution and equation (28.2)

28.3 A normal distribution (\({\mathcal{N}}_{({\mu}, \, {\sigma}^2)}\)) is a type of continuous probability distribution for a real-valued random variable.

Their importance is partly due to the Central Limit Theorem. Assumption of normal distribution allow us application of Parametric Methods

40.1 Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).

29.15 Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.

It states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable—whose distribution converges to a normal distribution as the number of samples increases.

Parametric statistical tests typically assume that samples come from normally distributed populations, but the central limit theorem means that this assumption is not necessary to meet when you have a large enough sample. A sample size of 30 or more is generally considered large.

This is the basis of Empirical Rule.

25.23 Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.

Caution: If data from small samples do not closely follow this pattern, then other distributions like the t-distribution may be more appropriate.

4.14 Standard Normal Distribution

Refer Standard Normal and equation (28.3)

28.4 A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal{z}}_{({\mu} = 0, \, {\sigma} = 1)}\)

The simplest case of a normal distribution is known as the standard normal distribution. Given the Population with normal distribution \({\mathcal{N}}_{(\mu, \, \sigma)}\)

If \(\overline{X}\) is the mean of a sample of size \({n}\) from this population, then the standard error is \(\sigma/{\sqrt{n}}\) and thus the z-score is \(Z=\frac {\overline{X} - \mu}{\sigma/{\sqrt{n}}}\)

The z-score is the test statistic used in a z-test. The z-test is used to compare the means of two groups, or to compare the mean of a group to a set value. Its null hypothesis typically assumes no difference between groups.

The area under the curve to the right of a z-score is the p-value, and it is the likelihood of your observation occurring if the null hypothesis is true.

Usually, a p-value of 0.05 or less means that your results are unlikely to have arisen by chance; it indicates a statistically significant effect.

4.15 Outliers

Refer Outliers: C03

25.24 Outliers are data points or observations that does not fit the trend shown by the remaining data. These differ significantly from other observations. Unusually large or small values are commonly found to be outliers.

  • Question: If we include a datapoint which is 4 standard deviations away, would we be able to get the Normal Distribution
    • Shape of the curve will be tilted, thus it will be difficult to keep the datapoint and satify the condition for normality
    • Generally, only \({{\mu} - 3{\sigma} \leq {x} \leq {\mu} + 3{\sigma}}\) values are kept and the remaining are treated as outliers
  • Question: Is is a bad data if it is 4 standard deviations away
    • It means that if we keep the data point, there is a high possibility that we will violate the normality assumption. If we violate the assuption, parametric methods cannot be applied to the dataset
    • In general, convert to z-value, remove those which have z-value higher than +3 or lower than -3
  • Question: But, how many removals are too many removals
    • There are techniques for this consideration, will be covered later. “ForLater”
    • (Aside) In a sample of 1000 observations, the presence of up to five observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected. If the sample size is only 100, however, just three such outliers are already reason for concern.
  • Concern: Frequency or Proportion of outliers should not be very high
    • It is true that we cannot have normal distribution in that case.
      • However, can we afford to remove all the data points with \(z > 3\), this needs to be answered in the context of analysis.
        • Here the ‘assignable cause’ is applied. i.e. each datapoint that is proposed to be an outlier is individually analysed and either kept or removed
  • Concern: Sometimes the outliers are present because the dataset is a mixture of two distributions
    • In that case, those should be treated separately
  • Question: Are there tools for all of this jugglery
    • Yes, there are, specially nonparametric methods does not take any assumption about distribution.
    • However, these are not as robust as parametric tests, so if possible, stay with parametric tests

4.16 Type I and Type II Errors

(C09P01) Type-I $(\alpha)$ and Type-II $(\beta)$ Errors

Figure 4.7 (C09P01) Type-I \((\alpha)\) and Type-II \((\beta)\) Errors

Example

  • Type-I “An innocent person is convicted”
  • Type-II “A guilty person is not convicted”

Since we are using sample data to make inferences about the population, it is possible that we will make an error. In the case of the Null Hypothesis, we can make one of two errors.

Refer Type I and Type II Errors

31.7 The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).

31.8 The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).

31.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.

30.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

31.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).

There is always a tradeoff between Type-I and Type-II errors.

  • Generally max 5% \({\alpha}\) and max 20% \({\beta}\) errors are recommended

In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting \({\alpha}\), that person is controlling the probability of making a Type I error.

  • If the cost of making a Type I error is high, small values of \({\alpha}\) are preferred. Ex: \(\alpha =0.01\)
  • If the cost of making a Type I error is not too high, larger values of \({\alpha}\) are typically used. Ex: \(\alpha = 0.05\)

31.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.

Although most applications of hypothesis testing control for the probability of making a Type I error, they do not always control for the probability of making a Type II error. Because of the uncertainty associated with making a Type II error when conducting significance tests, statisticians usually recommend that we use the statement "do not reject \({H_0}\)" instead of “accept \({H_0}\).”

4.17 Critical Value

(B12P11 B12P12) Left Tail vs. Right Tail(B12P11 B12P12) Left Tail vs. Right Tail

Figure 4.8 (B12P11 B12P12) Left Tail vs. Right Tail

(B12P13) Two Tail

Figure 4.9 (B12P13) Two Tail

31.18 Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).

# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
#
# #Left Tail Test
round(qnorm(p = xxalpha, lower.tail = TRUE), 4)
##     10%      5%    5/2%      1%    1/2% 
## -1.2816 -1.6449 -1.9600 -2.3263 -2.5758
#
# #Right Tail Test
round(qnorm(p = xxalpha, lower.tail = FALSE), 4)
##    10%     5%   5/2%     1%   1/2% 
## 1.2816 1.6449 1.9600 2.3263 2.5758

31.16 A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).

31.19 A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.

31.21 A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.

4.18 Tailed Tests

31.12 A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.

One tailed-tests are concerned with one side of a statistic. Whereas, Two-tailed tests deal with both tails of the distribution.

Two-tail test is done when you do not know about direction, so you test for both sides.

31.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)

31.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

31.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

4.19 Approaches

31.15 The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.

Steps for the p-value approach or test statistic approach

  • Calculate \(z\) for given \(x\): \(z = \frac{\overline{x} - \mu_0}{s}\)
  • Refer Get P(z) by pnorm() or z by qnorm(), to get the p-value from z-table
    • \(P_{\left(\overline{x}\right)} = P_{\left(z\right)}\)
  • Compare p-value with Level of significance \({\alpha}\)

31.17 The critical value approach requires that we first determine a value for the test statistic called the critical value.

Steps for the critical value approach

  • Calculate \(z\) for given \(x\): \(z = \frac{\overline{x} - \mu_0}{s}\)
  • Using the z-table, find the z for given Level of significance \({\alpha} = 0.01\)
  • Compare test statistic with z-value i.e. \((z)\) vs. \((z_{\alpha = 0.01})\)

4.20 z-test vs. t-test

If the population standard error (SE) is known, apply z-test. If it is unknown, apply t-test. t-test will converge to z-test with increasing sample size.

Question: Does the probability from t-table differ from the probability value from z-table

  • No, practically for sample size greater than 30, there is no difference

It is assumed that \((\overline{x} - \mu)\) follows Normality. However the Standard Error (SE) does not follow normality, generally it follows chi-sq distribution. Thus, \((\overline{x} - \mu)/SE\) becomes ‘Normal/ChiSq’ and this ratio follows the t-distribution. Thus, the test we apply is called t-test.

# #For Degrees of Freedom = 10 (N=11)
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
dof <- 10L
#
# #Left Tail Test
round(qt(p = xxalpha, df = dof, lower.tail = TRUE), 4)
##     10%      5%    5/2%      1%    1/2% 
## -1.3722 -1.8125 -2.2281 -2.7638 -3.1693
#
# #Right Tail Test
round(qt(p = xxalpha, df = dof, lower.tail = FALSE), 4)
##    10%     5%   5/2%     1%   1/2% 
## 1.3722 1.8125 2.2281 2.7638 3.1693

4.21 t-test

4.21.1 Degrees of Freedom

30.5 The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).

Why \((n-1)\) are the degrees of freedom

  • Degrees of freedom refer to the number of independent pieces of information that go into the computation. i.e. \(\{(x_{1}-\overline{x}), (x_{2}-\overline{x}), \ldots, (x_{n}-\overline{x})\}\)
  • However, \(\sum (x_{i}-\overline{x}) = 0\) for any data set.
  • Thus, only \((n − 1)\) of the \((x_{i}-\overline{x})\) values are independent.
    • if we know \((n − 1)\) of the values, the remaining value can be determined exactly by using the condition.

Question: Is there any minimum sample size we must consider before calculating degrees of freedom

  • Larger sample sizes are needed if the distribution of the population is highly skewed or includes outliers.

Guess: Degrees of freedom is also calculated to remove the possible bias

  • No

4.21.2 How to use t-table

  • Rows have degrees of freedom, Columns have \({\alpha}\) values, get the t-statistic at their intersection
    • For DOF = 10, and \({\alpha} = 0.05\), t-table has value 1.812 (Critical Limit)
    • In right tail test, if the test-statistic is greater than critical limit, we can reject the null

Validation


5 Statistics (B13, Oct-03)

5.1 Overview

  • “Introduction to Statistics”

5.2 Definitions

(C09P01) Type-I $(\alpha)$ and Type-II $(\beta)$ Errors

Figure 5.1 (C09P01) Type-I \((\alpha)\) and Type-II \((\beta)\) Errors

Refer Type I and Type II Errors (B12) & Type I and Type II Errors

31.7 The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).

31.8 The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).

31.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.

30.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

31.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).

31.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.

31.23 p-value Approach: Form Hypothesis | Specify \({\alpha}\) | Calculate test statistic | Calculate p-value | Compare p-value with \({\alpha}\) | Interpret

29.13 In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’

5.3 Approaches

Population Size = 100, \({\alpha} = 0.05\)

Hypothesis: \(\text{\{Right Tail or Upper Tail\} } {H_0} : {\mu} \leq 22 \iff {H_a}: {\mu} > 22\)

Sample: n=4, dof = 3, \({\overline{x}} = 23\)

Sample: n=10, dof = 9, \({\overline{x}} = 23\)

We know if we take another sample, we will have a different sample mean. So, we need to confirm whether the above calculated sample mean \({\overline{x}} = 23\) represent the population mean \({\mu}\) i.e. Can we reject or fail to reject \({H_0}\) based on this sample!

3 Approaches for Hypothesis Testing -

  1. Test Statistic Approach
    • Fictitious values: Standard Error (SE) = 0.22, so \(t = \frac{23 - 22}{0.22} = 4.545\)
    • For (DOF = 3): \(P_{(t)} = {\alpha} = 0.05\), at \({}^{3}t_{\alpha} = 2.353\)
    • For (DOF = 9): \(P_{(t)} = {\alpha} = 0.05\), at \({}^{9}t_{\alpha} = 1.833\)
    • For both the cases, \({t}\) is greater than \({}^{dof}t_{\alpha}\)
    • Hence null is rejected, the ‘test is statistically significant’
  2. p-value approach
    • Fictitious values: Standard Error (SE) = 0.22, so \(t = \frac{23 - 22}{0.22} = 4.545\)
    • Get \({}^3\!P_{(t = 4.545)} = 0.00997\)
    • Get \({}^9\!P_{(t = 4.545)} = 0.000697\)
    • For both the cases, \(P_{(t)}\) is lower than \({\alpha}\)
    • Hence null is rejected, the ‘test is statistically significant’
  3. Confidence Interval Approach

If the population standard error (SE) is known, apply z-test. If it is unknown, apply t-test. t-test will converge to z-test with increasing sample size.

2-T Rule of Thumb - Skipped “09:55”

# #Get P(z)
z01 <- round(pnorm(3.44), digits = 6)
z02 <- 1 - round(pnorm(3.44), digits = 6)
z03 <- round(pnorm(3.44, lower.tail = FALSE), digits = 6)
z04 <- format(pnorm(4.55, lower.tail = FALSE), digits = 3, scientific = FALSE)
z05 <- format(pnorm(1.22, lower.tail = FALSE), digits = 5)
z06 <- format(pnorm(1.99, lower.tail = FALSE), digits = 5)
z07 <- format(pnorm(1.99, lower.tail = TRUE), digits = 5)

Examples

Example:

  1. Question: If we get a z-value of 3.44 (Right Tail), What is the Probability \(P_{(z)}\)
    • For z = 3.44 & Left Tail, p-value = 0.999709 (by ‘pnorm(z)‘)
    • For z = 3.44 & Right Tail, p-value = 0.000291 (by ‘1 - pnorm(z)‘)
    • For z = 3.44 & Right Tail, p-value = 0.000291 (by ‘pnorm(z, lower.tail = FALSE)‘)
  2. Question: If we get a z-value of 4.55 (Right Tail), What is the Probability \(P_{(z)}\)
    • For z = 4.55 & Right Tail, p-value = 0.00000268
  3. Question: If we get a z-value of 1.22 (Right Tail), would we reject the null at \({\alpha} = 0.05\)
    • For z = 1.22 & Right Tail, p-value = 0.11123
    • Because \(P_{(z)}\) is greater than the \({\alpha}\), we fail to reject the null, the ‘test is statistically NOT significant’
  4. Question: If we get a z-value of 1.99 (Right Tail), would we reject the null at \({\alpha} = 0.05\)
    • For z = 1.99 & Right Tail, p-value = 0.023295 ( = 1 - 0.9767)
    • Because \(P_{(z)}\) is lower than the \({\alpha}\), null is rejected, the ‘test is statistically significant’

Code

# #Get P(z)
z01 <- round(pnorm(3.44), digits = 6)
z02 <- 1 - round(pnorm(3.44), digits = 6)
z03 <- round(pnorm(3.44, lower.tail = FALSE), digits = 6)
z04 <- format(pnorm(4.55, lower.tail = FALSE), digits = 3, scientific = FALSE)
z05 <- format(pnorm(1.22, lower.tail = FALSE), digits = 5)
z06 <- format(pnorm(1.99, lower.tail = FALSE), digits = 5)
z07 <- format(pnorm(1.99, lower.tail = TRUE), digits = 5)

5.4 Flowchart

  • Tests |
    • Test of Means |
      • One Sample |
        • z-test (Population Standard Deviation \({\sigma}\), is known)
        • t-test (Population Standard Deviation \({\sigma}\), is unknown)
      • Two Sample |
      • More than Two Samples

5.5 Two Sample t-Test

32.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)

32.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

32.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

Example:

32.6 Independent sample design: A simple random sample of workers is selected and each worker in the sample uses method 1. A second independent simple random sample of workers is selected and each worker in this sample uses method 2.

32.7 Matched sample design: One simple random sample of workers is selected. Each worker first uses one method and then uses the other method. The order of the two methods is assigned randomly to the workers, with some workers performing method 1 first and others performing method 2 first. Each worker provides a pair of data values, one value for method 1 and another value for method 2.

Test Statistic for Independent Sample t-Test Statistic is given by (32.9) as shown below

\[t = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}} = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{\sqrt{\frac{{s}_1^2}{{n}_1} + \frac{{s}_2^2}{{n}_2}}}\]

The t-test is any statistical hypothesis test in which the test statistic follows a Student t-distribution under the null hypothesis.

A t-test is the most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known.

Example: If we want to evaluate effect of a Training Program.

We can take two samples of 50 people each. First Set “Untrained” would be from the set of people who did not receive training. Second Set “Trained” would be from the set of people who have undergone the training. Comparison of these two sample mean performances would be done by “independent sample” t-test.

Or

We can take a sample of 50 “Untrained” people. Get their mean performance. Provide the training of these 50 people. Then again get their mean performance. Now, we have “paired” samples of performances of same people. One set has their performance before the training and another is after the training. Comparison of these two sample mean performances would be done by “paired sample” t-test.

Paired samples t-tests typically consist of a sample of matched pairs of similar units, or one group of units that has been tested twice (a “repeated measures” t-test).

The matched sample design is generally preferred to the independent sample design because the matched-sample procedure often improves the precision of the estimate.

  • Question: How do you conclude that the training was effective. Or even that the training didnot make the situation worse somehow. Assume 25 people performed better and 25 people performed worse (somehow!)
    • (Prof) In this situation we cannot claim that the training is effective.
      • We start with the null hypothesis that the two sample means are same (Two Tail Test). If we are able to reject this. Then we can perform Upper or Lower Tail Test and again try to find a significant result.
    • If we cannot reject the null hypothesis, then we conclude that “we cannot claim that the training is effective.”
  • Example: If we want to comment on performance of new employees compared to ideal value of 95. Take a sample of 20 people and get their perfomance.
    • We would need to perform One Sample t-test
  • Example: If we want to comment on performance of Engineers and Non-engineers
    • Two Sample Independent t-test
  • Example: We want to check whether we have recruited more number of females compared to males.
    • Two Sample Proportion Test
  • Example: If we want to comment on their induction training program by conducting a test before and after the program
    • Two Sample Paired t-test

5.6 More than Two Samples

Assume there are 3 samples A, B, C. We can do \(C_2^3 = 3\) number of tests i.e. \(\{(A, B), (B, C), (C, A)\}\). However, assuming \({\alpha} = 0.05 \iff {\gamma} = 0.95\) for each test, the confidence for 3 consecutive tests become \({\gamma}^3 = 0.857 \iff {\alpha} = 0.143\), which is a very high and unacceptable value. To avoid this, we use ANOVA as a single test.

High value of F-test would indicate that the populations are different.

Validation


6 Statistics (B14, Oct-10)

6.1 Overview

Equality in Hypothesis

  • The equality part of the expression \(\{\mu \geq \mu_0 \, | \, \mu \leq \mu_0 \, | \, \mu = \mu_0\}\) always appears in the null hypothesis \({H_0}\).
    • We try to reject null, so that we can confidently accept the alternate. If the alternate is ambiguous e.g. “is greater than or equal to” then we will not be able to conclude with confidence.
  • Alternative hypothesis is often what the test is attempting to establish.
    • Hence, asking whether the user is looking for evidence to support \(\{\mu < \mu_0 \, | \, \mu > \mu_0 \, | \, \mu \neq \mu_0\}\) will help determine \({H_a}\)

31.14 If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (31.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)

31.24 If \({\sigma}\) is unknown, the sampling distribution of the test statistic follows the t distribution with \((n - 1)\) degrees of freedom. Refer equation (31.3) \(\to t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}}\)

6.2 Example: WSES: Preprocessing

Please import the WSES data in xxWSES object. Due to copyright, it has not been shared.

  • Assuming: Average Sales 8-million \(({\mu}_0 = 8)\), Standard Deviation 2-million \(({\sigma} = 2)\)
  • Hypothesis test to check whether the average sales value in the population is at least 8-million

31.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

Caution: While importing data, for Mac users, probably it will be easier to import CSV file. However, I am not a Mac user so cannot comment on this.

Data

# #Import the Data and assign to a temporary variable for ease of use
xxWSES <- f_getRDS(xxWSES)
bb <- xxWSES
str(bb)
## spec_tbl_df [1,000 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Opportunity No.                 : num [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Reporting Status                : chr [1:1000] "Lost" "Won" "Lost" "Won" ...
##  $ Sales Outcome                   : num [1:1000] 0 1 0 1 1 0 0 0 0 0 ...
##  $ Product                         : chr [1:1000] "LearnSys" "GTMSys" "GTMSys" "GTMSys" ...
##  $ Industry                        : chr [1:1000] "Banks" "Airline" "Capital Markets" "Insurance" ...
##  $ Region                          : chr [1:1000] "Africa" "UK" "UK" "UK" ...
##  $ Relative Strength in the segment: num [1:1000] 45 56 48 58 49 64 54 54 67 54 ...
##  $ Profit of Customer in Million   : num [1:1000] 2.11 0.79 1.62 0.09 1.46 0.94 1.32 1.09 1.3 1.32 ...
##  $ Sales Value in Million          : num [1:1000] 10.29 11.42 5.63 10.17 10.6 ...
##  $ Profit %                        : num [1:1000] 29 46 70 46 32 65 50 74 46 57 ...
##  $ WSES Proportion in Joint Bid    : num [1:1000] 66 50 50 56 54 52 68 48 56 63 ...
##  $ Leads Conversion Class          : chr [1:1000] "F" "E" "F" "E" ...

Rename Headers

# #List Column Headers
names(bb)
##  [1] "Opportunity No."                  "Reporting Status"                
##  [3] "Sales Outcome"                    "Product"                         
##  [5] "Industry"                         "Region"                          
##  [7] "Relative Strength in the segment" "Profit of Customer in Million"   
##  [9] "Sales Value in Million"           "Profit %"                        
## [11] "WSES Proportion in Joint Bid"     "Leads Conversion Class"
#
# #Rename Headers
bb_headers <- c("SN" , "RS" , "SO" , "PDT" , "INT" , "RG" , "RS1" , "PM" , "SVM" , "PP" , "JB" , "LCC")
names(bb) <- bb_headers
#
# #Verification
names(bb)
##  [1] "SN"  "RS"  "SO"  "PDT" "INT" "RG"  "RS1" "PM"  "SVM" "PP"  "JB"  "LCC"
#

Conversion to Factor

  • From the case study, it can be seen that multiple columns are categorical (factor) or ordinal (ordered factor)

  • Question: What is the importance of having this kind of order factor over simple factor

    • Here order is also important. Also in future, ordered factors will be needed for some analysis
    • (Aside) Refer Scales of Measurement
      • Simple factor (nominal) can provide only Mode whereas Ordered factor (ordinal) can provide Median also. Rank based statistical models can be applied on the ordinal data.
  • Question: If ‘RS’ is already integer 0 & 1, then why convert it to factor

    • While the data shows them as 0 & 1 but actually they are NOT integers
    • If we show male and female as 0 & 1, these are still categorical
  • Question: Why LCC is NOT ordinal (Aside)

    • Unknown “ForLater”

Data

# #"Reporting Status i.e. RS" Converting "character" to "factor" and Label them
bb$RS <- factor(bb$RS, levels = c("Lost", "Won"), labels = c("0", "1"))
#
# #"Sales Outcome i.e. SO" Converting "numeric" to "factor" 
bb$SO <- factor(bb$SO)
#
# #"Product Vertical i.e. PDT" Ordinal
# #What are the unique values in this column
unique(bb$PDT)
## [1] "LearnSys"   "GTMSys"     "Lifesys"    "Finsys"     "Procsys"    "Logissys"   "ContactSys"
#
# #Converting "character" to "Ordered factor"
# #Note: If level order is not provided, by default, alphabatical ordering will be assigned.
levels(factor(bb$PDT, ordered = TRUE))
## [1] "ContactSys" "Finsys"     "GTMSys"     "LearnSys"   "Lifesys"    "Logissys"   "Procsys"
#
# #Provide ordering of factor levels in Ascending Order.
bb$PDT <- factor(bb$PDT, ordered = TRUE, 
      levels = c("GTMSys", "Procsys", "LearnSys", "Finsys", "Lifesys", "Logissys", "ContactSys"))
#
# #"Industry i.e. INT" Ordinal
bb$INT <- factor(bb$INT, ordered = TRUE, 
      levels = c("Capital Markets", "Banks", "Defense", "Consumer goods", "Others", "Security", 
        "Energy", "Insurance", "Airline", "Finance", "Infrastructure", "Mobility", "Other Govt.", 
        "Govt.", "Telecom equipments", "Health", "Clinical research", "Agriculture"))
#
# #"Region i.e. RG" Ordinal
bb$RG <- factor(bb$RG, ordered = TRUE, levels = c("UK", "Other Europe", "Americas", "Africa",
                                                "India", "Japan", "Singapore", "Spain", "Canada"))
#
# #"Leads Conversion Class i.e. LCC" Ordinal, However we are going with Nominal here.
bb$LCC <- factor(bb$LCC, levels = c("E", "V", "F", "L"), labels = c(1, 2, 3, 4))

Structure after Conversion

str(bb)
## spec_tbl_df [1,000 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ SN : num [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
##  $ RS : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
##  $ SO : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
##  $ PDT: Ord.factor w/ 7 levels "GTMSys"<"Procsys"<..: 3 1 1 1 1 5 4 4 1 4 ...
##  $ INT: Ord.factor w/ 18 levels "Capital Markets"<..: 2 9 1 8 5 5 4 4 16 4 ...
##  $ RG : Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 4 1 1 1 1 2 1 1 1 1 ...
##  $ RS1: num [1:1000] 45 56 48 58 49 64 54 54 67 54 ...
##  $ PM : num [1:1000] 2.11 0.79 1.62 0.09 1.46 0.94 1.32 1.09 1.3 1.32 ...
##  $ SVM: num [1:1000] 10.29 11.42 5.63 10.17 10.6 ...
##  $ PP : num [1:1000] 29 46 70 46 32 65 50 74 46 57 ...
##  $ JB : num [1:1000] 66 50 50 56 54 52 68 48 56 63 ...
##  $ LCC: Factor w/ 4 levels "1","2","3","4": 3 1 3 1 2 3 2 2 4 4 ...

factor()

  • factor()
    • If level order is not provided, by default, alphabatical ordering will be assigned.
    • levels are the input, labels are the output in the factor() function.
    • A factor has only a level attribute, which is set by the labels argument in the factor() function.
    • different levels are coded as (“E,” “V,” “F,” “L”)
      • for which you want the levels to be labeled as c(1, 2, 3, 4).
      • The factor function will look for the values (“E,” “V,” “F,” “L”) convert them to numerical factor classes and add the label values to the ‘level’ attribute of the factor.
      • This attribute is used to convert the internal numerical values to the correct labels.
      • Hoever, there is no ‘label’ attribute.

6.2.1 Conversion to Numeric

# #If there are "character" columns which should be "numeric"
bb$RS1 <- as.numeric(bb$RS1)
bb$PP <- as.numeric(bb$PP)
bb$JB <- as.numeric(bb$JB)
# #Equivalent
bb <- bb %>% mutate(across(c(RS1, PP, JB), as.numeric))

6.3 WSES: Analysis

6.3.1 Q1

Assume average sales of 8-million dollars and population standard deviation to be 2-million dollars. Check whether the average sales value in the population is at least 8 million dollars.

31.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

  • Mean of “Sales Value in Million i.e. SVM”
    • \({\overline{x}} = 8.0442\)
      • mean(bb$SVM) \(\#\mathcal{R}\)
  • \(n = 1000, {\mu} = 8, {\sigma} = 2\)
  • \(z = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}} = 0.6988\)
    • {8.0442 - 8} / {2 / sqrt(1000)} \(\#\mathcal{R}\)
    • {mean(bb$SVM) - 8} / {2 / sqrt(1000)} \(\#\mathcal{R}\)
  • \({}^U\!P_{z = 0.6988} = 0.2423\)
    • pnorm(q = 0.6988, lower.tail = FALSE) \(\#\mathcal{R}\)
    • 1 - pnorm(q = 0.6988, lower.tail = TRUE) \(\#\mathcal{R}\)
    • 1 - pnorm(q = 0.6988) \(\#\mathcal{R}\)
    • pnorm(q = -0.6988) \(\#\mathcal{R}\)
    • Caution: By default, pnorm() provides the probability to the left of z-value
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(z)} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude that the average sales is ‘at least’ 8-million

6.3.2 Q2

If the population standard deviation is unknown, evaluate the same hypothesis again.

  • \(t = \frac{\overline{x} - {\mu}_0}{{s}/\sqrt{n}} = 0.7049\)
    • {mean(bb$SVM) - 8} / {sd(bb$SVM) / sqrt(1000)} \(\#\mathcal{R}\)
  • \({}^U\!P_{t = 0.7049} = 0.2405\)
    • pt(q = 0.7049, df = nrow(bb) - 1, lower.tail = FALSE) \(\#\mathcal{R}\)
  • t.test()
    • t.test(bb$SVM, mu = 8, alternative = "greater") \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(t)} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude that the average sales is ‘at least’ 8-million (same as earlier)
  • Question: “alternative hypothesis: true mean is greater than 8” What is the meaning of this line in the output of t.test
    • We provided this information.
    • (Aside) This line is providing the Alternate Hypothesis for reference. \({\mu}_0 = 8\) and we are performing an upper tail test thus the Alternate Hypothesis is “true mean is greater than 8”

6.3.3 Q3

Check whether the proportion of leads won by WSES is more than 50%.

31.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)

  • Count of Success \(({x})\) is Winning leads in the “Sales Outcome i.e. SO”
  • \({p}_0 = 0.50\)
    • (31.4) \({\sigma}_{\overline{p}} = \sqrt{\frac{{p}_0 (1 - {p}_0)}{n}} = 0.0158\)
      • sqrt(0.50 * {1 - 0.50} / 1000) \(\#\mathcal{R}\)
  • \(\{n = 1000, x = 481\} \to {\overline{p}} = {n}/{x} = 0.481\)
  • (31.5) \(z = \frac{{\overline{p}} - {p}_0}{{\sigma}_{\overline{p}}} = -1.2016\)
    • {0.481 - 0.50}/{sqrt(0.50 * {1 - 0.50} / 1000)} \(\#\mathcal{R}\)
  • \({}^U\!P_{z = -1.2016} = 0.8852\)
    • pnorm(q = -1.2016, lower.tail = FALSE) \(\#\mathcal{R}\)
    • 1 - pnorm(q = -1.2016) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(z)} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude that the company wins ‘at least’ 50% leads

Code

# #Proportions
bb %>% group_by(SO) %>% summarise(PCT = n() / nrow(.))
## # A tibble: 2 x 2
##   SO      PCT
##   <fct> <dbl>
## 1 0     0.519
## 2 1     0.481
#
pnorm(q = -1.2016, lower.tail = FALSE)
## [1] 0.8852407

Percentage

# #Grouped Percentages
# #table() gives a count
table(bb$SO)
## 
##   0   1 
## 519 481
#
# #prop.table() can work only with numbers so it needs table()
prop.table(table(bb$SO))
## 
##     0     1 
## 0.519 0.481
#
# #Similar
bb %>% group_by(SO) %>% summarise(N = n()) %>% mutate(PCT = N / sum(N))
## # A tibble: 2 x 3
##   SO        N   PCT
##   <fct> <int> <dbl>
## 1 0       519 0.519
## 2 1       481 0.481
bb %>% group_by(SO) %>% summarise(PCT = n() / nrow(.))
## # A tibble: 2 x 2
##   SO      PCT
##   <fct> <dbl>
## 1 0     0.519
## 2 1     0.481

6.3.4 Q4

Check whether the probability of winning a sales lead for the product “learnsys” is more than that of “Finsys.”

32.10 \(\text{\{Right or Upper\} } {H_0} : {p}_1 - {p}_2 \leq 0 \iff {H_a}: {p}_1 - {p}_2 > 0\)

  • Count of Success \(({x})\) is Winning leads in the “Sales Outcome i.e. SO”
  • (1: learnsys) \(\{{n}_1 = 55 + 71 = 126, {x}_1 = 71\} \to {\overline{p}}_1 = {n}_1/{x}_1 = 0.563\)
  • (2: Finsys) \(\{{n}_2 = 83 + 34 = 117, {x}_2 = 34\} \to {\overline{p}}_2 = {n}_2/{x}_2 = 0.29\)
  • \({}^U\!P_{z} < {\alpha} \to {H_0}\) is rejected i.e. the proportions are different
    • We can conclude that the “learnsys” performs better than “Finsys”
  • Question: When we have filtered out the data, why the table shows 0 values
    • We can filter them out, however these are in memory
    • (Aside) Factor levels are not dropped, by default, when you filter them. Use factor() again to drop the unused levels.

Test

# #Data | Subset | Filter | Update Factor levels
ii <- bb %>% select(PDT, SO, SVM) %>% 
  filter(PDT %in% c("LearnSys", "Finsys")) %>% mutate(across(PDT, factor))
str(ii)
## tibble [243 x 3] (S3: tbl_df/tbl/data.frame)
##  $ PDT: Ord.factor w/ 2 levels "LearnSys"<"Finsys": 1 2 2 2 1 2 2 1 2 2 ...
##  $ SO : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
##  $ SVM: num [1:243] 10.29 5.25 7.3 9.69 6.82 ...
#
# #Count
table(ii$PDT, ii$SO)
##           
##             0  1
##   LearnSys 55 71
##   Finsys   83 34
#
# #Proportion Table
round(prop.table(table(ii$PDT, ii$SO), margin = 1), 3)
##           
##                0     1
##   LearnSys 0.437 0.563
##   Finsys   0.709 0.291
#
prop.test(x = c(71, 34), n = c(126, 117), alternative = "greater")
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(71, 34) out of c(126, 117)
## X-squared = 17.316, df = 1, p-value = 0.00001583
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.1644089 1.0000000
## sample estimates:
##    prop 1    prop 2 
## 0.5634921 0.2905983

prop.table()

ii <- bb %>% select(PDT, SO, SVM) %>% 
  filter(PDT %in% c("LearnSys", "Finsys")) %>% mutate(across(PDT, factor))
str(ii)
## tibble [243 x 3] (S3: tbl_df/tbl/data.frame)
##  $ PDT: Ord.factor w/ 2 levels "LearnSys"<"Finsys": 1 2 2 2 1 2 2 1 2 2 ...
##  $ SO : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
##  $ SVM: num [1:243] 10.29 5.25 7.3 9.69 6.82 ...
#
# #Proportion Table: margin gives the margin to split by i.e. 
# #1 means rowwise sum, 2 means columnwise
round(prop.table(table(ii$PDT, ii$SO), margin = 1), 3)
##           
##                0     1
##   LearnSys 0.437 0.563
##   Finsys   0.709 0.291
round(prop.table(table(ii$PDT, ii$SO), margin = 2), 3)
##           
##                0     1
##   LearnSys 0.399 0.676
##   Finsys   0.601 0.324
#
# #Similar
ii %>% select(PDT, SO) %>% 
  count(PDT, SO) %>% pivot_wider(names_from = SO, values_from = n) %>% 
  mutate(SUM = rowSums(across(where(is.numeric)))) %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1))) 
## # A tibble: 2 x 4
##   PDT        `0`   `1`   SUM
##   <ord>    <dbl> <dbl> <dbl>
## 1 LearnSys  43.7  56.3   100
## 2 Finsys    70.9  29.1   100

6.3.5 Q5

Check whether the average sales value of “learnsys” projects is higher than that of “Finsys” projects. (\({\alpha} = 0.05\))

32.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

  • (Pooled) t-test for difference of means can be applied only if the variances are same.
    • Location of two distributions can be compared only when their spread is similar.
    • (Aside) if the variances are not same, then Welch Test is applied in place of Pooled Test.
  • \({}^U\!P_{t} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude that the “learnsys” has higher average sales than “Finsys”
  • Question: What is ‘SVM ~ PDT’
    • It is the formula notation for t-test (in R) in the form of ‘Continuous ~ Categorical’

“ForLater”

  • Question: How to decide whether to keep the Lost bids or exclude them
    • Including lost bids: \(\text{(DOF)} = 241, t = 0.93503, P_{(t)} = 0.1754 \to {H_0}\) cannot be rejected
    • Excluding lost bids: \(\text{(DOF)} = 103, t = 0.62152, P_{(t)} = 0.2678 \to {H_0}\) cannot be rejected
    • Thus, no change in Hypothesis results but which which process should be utilised
    • Similar, question can be raised for Q6 i.e. should be consider profit of those bids which we lost
str(ii)
## tibble [243 x 3] (S3: tbl_df/tbl/data.frame)
##  $ PDT: Ord.factor w/ 2 levels "LearnSys"<"Finsys": 1 2 2 2 1 2 2 1 2 2 ...
##  $ SO : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
##  $ SVM: num [1:243] 10.29 5.25 7.3 9.69 6.82 ...
ii_var <- var.test(SVM ~ PDT, data = ii)
if(ii_var$p.value > 0.05) {
  cat(paste0("Variances are same. Pooled Test can be applied. \n"))
  t.test(formula = SVM ~ PDT, data = ii, alternative = "greater", var.equal = TRUE)
} else {
  cat(paste0("Variances are NOT same. Welch Test should be applied. \n"))
  t.test(formula = SVM ~ PDT, data = ii, alternative = "greater", var.equal = FALSE)
}
## Variances are same. Pooled Test can be applied.
## 
##  Two Sample t-test
## 
## data:  SVM by PDT
## t = 0.93503, df = 241, p-value = 0.1754
## alternative hypothesis: true difference in means between group LearnSys and group Finsys is greater than 0
## 95 percent confidence interval:
##  -0.1728648        Inf
## sample estimates:
## mean in group LearnSys   mean in group Finsys 
##               8.030476               7.804786
jj <- ii %>% filter(SO == "1")
jj_var <- var.test(SVM ~ PDT, data = jj)
if(jj_var$p.value > 0.05) {
  t.test(formula = SVM ~ PDT, data = jj, alternative = "greater", var.equal = TRUE)
} else {
  cat(paste0("Problem: Difference of means can be tested only if the variances are same.\n"))
}
## 
##  Two Sample t-test
## 
## data:  SVM by PDT
## t = 0.62152, df = 103, p-value = 0.2678
## alternative hypothesis: true difference in means between group LearnSys and group Finsys is greater than 0
## 95 percent confidence interval:
##  -0.3949429        Inf
## sample estimates:
## mean in group LearnSys   mean in group Finsys 
##               7.947887               7.711471

6.3.6 Q6

Check whether there is a difference in the average profit across the geographical locations: United Kingdom, India and the Americas.

35.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)

  • We need to conduct ANOVA because there are more than 2 levels we are checking
  • aov()
    • Total Variance = Between or MSTR + Within or MSE
      • Sum Sq provides SSTR & SSE
      • Mean Sq provides MSTR (Between) & MSE (Within)
    • First Line (Column): \(\text{DOF}_{(k-1)} = 2, \text{SSTR} = 297, \text{MSTR} = 148.7\)
    • Residuals (Within) : \(\text{DOF}_{(n-k)} = 689, \text{SSE} = 68075, \text{MSE} = 98.8\)
    • (35.10) \(F = \frac{\text{MSTR}}{\text{MSE}} = 1.5\)
    • \({}^U\!P_{F = 1.5} = 0.2238\)
      • pf(q = 1.5, df1 = 2, df2 = 689, lower.tail = FALSE) \(\#\mathcal{R}\)
    • Compare with \({\alpha} = 0.05\)
      • \({}^U\!P_{(F)} > {\alpha} \to {H_0}\) cannot be rejected
      • The sample results do not provide sufficient evidence to conclude that the average profit differs across the 3 geographical locations.
  • Question: Sample size of these 3 groups are different (UK 553, Americas 104, India 35). Would this impact our analysis
    • If the sample size are similar, then the power will be more but otherwise ANOVA is not very sensitive to unequal sample sizes. We can safely use it.
  • Warning:
    • “Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, …)”
    • remove ‘type’ option from aov()
ii <- bb %>% filter(RG %in% c("UK", "India", "Americas")) %>% select(RG, PP, SO)
str(ii)
## tibble [692 x 3] (S3: tbl_df/tbl/data.frame)
##  $ RG: Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ PP: num [1:692] 46 70 46 32 50 74 46 57 60 58 ...
##  $ SO: Factor w/ 2 levels "0","1": 2 1 2 2 1 1 1 1 1 1 ...
#
# #ANOVA
ii_aov <- aov(formula = PP ~ RG, data = ii)
#
# #
model.tables(ii_aov, type = "means")
## Tables of means
## Grand mean
##         
## 50.7052 
## 
##  RG 
##        UK Americas India
##      50.6    50.33 53.51
## rep 553.0   104.00 35.00
#
# #Summary
summary(ii_aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## RG            2    297   148.7   1.505  0.223
## Residuals   689  68075    98.8
jj <- ii %>% filter(SO == "1")
str(jj)
## tibble [339 x 3] (S3: tbl_df/tbl/data.frame)
##  $ RG: Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ PP: num [1:339] 46 46 32 42 40 49 42 50 38 56 ...
##  $ SO: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
#
# #ANOVA
jj_aov <- aov(formula = PP ~ RG, data = jj)
#
# #
model.tables(jj_aov, type = "means")
## Tables of means
## Grand mean
##          
## 46.37168 
## 
##  RG 
##         UK Americas India
##      46.23    46.36 48.65
## rep 267.00    55.00 17.00
#
# #Summary
summary(jj_aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## RG            2     93   46.75   0.532  0.588
## Residuals   336  29508   87.82

6.3.7 Q7

Check whether the sales conversions are different for different geographical locations.

  • Both (Location and Sales Conversions) are categorical variables, So ChiSq Test is required
  • \(P_{\chi^2} < {\alpha} \to {H_0}\) is rejected i.e. proportions are different
    • It can be claimed that there is an association between these variables.
    • We can conclude that the sales conversions are different for different geographical locations
  • Warning:
    • “Warning in chisq.test() Chi-squared approximation may be incorrect:
    • Some of observed frequencies in the Table are too few e.g. Canada
ii <- bb %>% select(RG, SO)
str(ii)
## tibble [1,000 x 2] (S3: tbl_df/tbl/data.frame)
##  $ RG: Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 4 1 1 1 1 2 1 1 1 1 ...
##  $ SO: Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
#
round(prop.table(table(ii$RG, ii$SO), margin = 1), 3)
##               
##                    0     1
##   UK           0.517 0.483
##   Other Europe 0.582 0.418
##   Americas     0.471 0.529
##   Africa       0.409 0.591
##   India        0.514 0.486
##   Japan        0.375 0.625
##   Singapore    0.739 0.261
##   Spain        0.917 0.083
##   Canada       0.333 0.667
#
# #Chi-Sq Test
tryCatch(chisq.test(table(ii$RG, ii$SO)), 
         warning = function(w) {
           print(paste0(w))
           suppressWarnings(chisq.test(table(ii$RG, ii$SO)))
           })
## [1] "simpleWarning in chisq.test(table(ii$RG, ii$SO)): Chi-squared approximation may be incorrect\n"
## 
##  Pearson's Chi-squared test
## 
## data:  table(ii$RG, ii$SO)
## X-squared = 22.263, df = 8, p-value = 0.004452

6.3.8 Q8

  • Check whether the sales conversions depend on the sales value. Check this claim by making 3 groups of Sales Value: <6-million, [6-8], >8-million dollar

  • Both are categorical variables, So ChiSq Test is required

  • \(P_{\chi^2} > {\alpha} \to {H_0}\) cannot be rejected.

    • It cannot be claimed that there is an association between these variables.
    • The sample results do not provide sufficient evidence to conclude that the sales conversions depend on the sales value.
ii <- bb %>% select(SO, SVM)
#
# #Create 3 Groups with middle group inclusive of both 6 & 8 
ii$RSVM <- cut(ii$SVM, breaks = c(0, 5.9999, 8, 15), labels = 1:3)
#
summary(ii)
##  SO           SVM         RSVM   
##  0:519   Min.   : 1.640   1:155  
##  1:481   1st Qu.: 6.690   2:346  
##          Median : 8.000   3:499  
##          Mean   : 8.044          
##          3rd Qu.: 9.440          
##          Max.   :14.230
#
table(ii$RSVM, ii$SO)
##    
##       0   1
##   1  86  69
##   2 182 164
##   3 251 248
#
# #Chi-Sq Test
chisq.test(table(ii$RSVM, ii$SO))
## 
##  Pearson's Chi-squared test
## 
## data:  table(ii$RSVM, ii$SO)
## X-squared = 1.377, df = 2, p-value = 0.5023

Validation


7 Quiz (B15, Oct-17)

7.1 Overview

  • This covers a short quiz and a case study Case Study: JAT
    • Case analysis done on Nov-07 has been merged here.

7.2 Short Quiz

  1. In hypothesis testing,
    1. the smaller the Type I error, the smaller the Type II error will be
    2. the smaller the Type I error, the larger the Type II error will be
    3. Type II error will not be affected by Type I error
    4. the sum of Type I and Type II errors must equal to 1
    • Answer: b
  2. What type of error occurs if you accept \({H_0}\) when, in fact, it is not true
    1. Type II
    2. Type I
    3. either Type I or Type II, depending on the level of significance
    4. either Type I or Type II, depending on whether the test is one tail or two tail
    • Answer: a
  3. If the level of significance of a hypothesis test is raised from .01 to .05, the probability of a Type II error
    1. will also increase from .01 to .05
    2. will not change
    3. will decrease
    4. will increase
    • Answer: c
  4. The sum of the values of \({\alpha}\) and \({\beta}\)
    1. always add up to 1.0
    2. always add up to 0.5
    3. is the probability of Type II error
    4. None of these alternatives is correct
    • Answer: d
  5. Following the p-value approach, the null hypothesis is rejected if
    1. p-value less than or equal to \({\alpha}\)
    2. \({\alpha}\) < p-value
    3. p-value > \({\alpha}\)
    4. p-value = 1 - \({\alpha}\)
    • Answer: a
  6. The average manufacturing work week in metropolitan Chattanooga was 40.1 hours last year. It is believed that the recession has led to a reduction in the average work week. To test the validity of this belief, which of the following stated hypothesis formulation is true.
    1. \({H_0} : {\mu} < 40.1 \quad {H_a} : {\mu} \geq 40.1\)
    2. \({H_0} : {\mu} \geq 40.1 \quad {H_a} : {\mu} < 40.1\)
    3. \({H_0} : {\mu} > 40.1 \quad {H_a} : {\mu} \leq 40.1\)
    4. \({H_0} : {\mu} = 40.1 \quad {H_a} : {\mu} \neq 40.1\)
    • Answer: b
  7. Statement about a population developed for the purpose of testing is called:
    1. Hypothesis
    2. Hypothesis testing
    3. Level of significance
    4. Test-statistic
    • Answer: a
  8. The probability of rejecting the null hypothesis when it is true is called:
    1. Level of confidence
    2. Level of significance
    3. Power of the test
    4. Difficult to tell
    • Answer: b
  9. The dividing point between the region where the null hypothesis is rejected and the region where it is not rejected is said to be:
    1. Critical region
    2. Critical value
    3. Acceptance region
    4. Significant region
    • Answer: b
  10. If the critical region is located equally in both sides of the sampling distribution of test-statistic, the test is called:
    1. One tailed
    2. Two tailed
    3. Right tailed
    4. Left tailed
    • Answer: b
  11. Test of hypothesis \({H_0}\): \({\mu} \leq 50\) against \({H_a}\): \({\mu} > 50\) leads to:
    1. Left-tailed test
    2. Right-tailed test
    3. Two-tailed test
    4. Difficult to tell
    • Answer: b
  12. Test of hypothesis \({H_0}\): \({\mu} = 20\) against \({H_1}\): \({\mu} < 20\) leads to:
    1. Right one-sided test
    2. Left one-sided test
    3. Two-sided test
    4. All of the above
    • Answer: b
  13. A failed student has been promoted by an examiner; it is an example of:
    1. Type-I error
    2. Type-II error
    3. Unbiased decision
    4. Difficult to tell
    • Answer: b
  14. The probability of accepting \({H_0}\) when it is True is called:
    1. Power of the test
    2. Size of the test
    3. Level of confidence
    4. Confidence coefficient
    • Answer: d
  15. Power of a test is directly related to:
    1. Type-I error
    2. Type-II error
    3. Both (a) and (b)
    4. Neither (a) nor (b)
    • Answer: b

7.3 Case: JAT

Please import the Jayalaxmi data

# #Object Names for each sheet
namesJ <- c("xxJdata", "xxJbela", "xxJdhar", "xxJdiseases")
# #Dimensions of these datasets
str(lapply(namesJ, function(x) {dim(eval(parse(text = x)))}))
## List of 4
##  $ : int [1:2] 123 26
##  $ : int [1:2] 24 14
##  $ : int [1:2] 22 14
##  $ : int [1:2] 6 4

7.3.2 Q1

Test the claim that disease 6 (leaf curl) information was accessed at least 60 times every month on average since October 2017 due to this disease outbreak. \(({\alpha} = 0.05)\)

NOTE: Actually the claim is “at least 60 times every week.” Month is a printing error.

31.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

  • \({}^U\!P_{t = 2.341} = 0.01329 \to {}^U\!P_{(z)} < {\alpha} \to {H_0}\) is rejected
    • We can conclude that the information on disease 6 was accessed at least 60 times every Week on average since October 2017. The claim is correct.
    • Originally tested for Monthly Data, which will obviously be True if Weekly rates are at least 60 times
      • \({}^U\!P_{t = 5.3771} = 0.0005 \to {}^U\!P_{(z)} < {\alpha} \to {H_0}\) is rejected
  • R Notes: For ease during subsetting, convert the Date Column to Date Type
    • as.Date(), as_date()

Code (per Week Claim)

# #Data | Rename | Change from Date Time to Date
bb <- xxJdata %>% 
  rename(Dates = "Month-Year") %>% 
  #group_by(Dates) %>% 
  #summarise(D6 = sum(D6)) %>% 
  mutate(across(Dates, as_date)) 
# 
# #Get relevant rows using filter() #xxJdata[95:123, ]
ii <- bb %>% filter(Dates >= "2017-10-01")
#
t_ii <- {mean(ii$D6) - 60} / {sd(ii$D6) / sqrt(nrow(ii))}
print(t_ii)
## [1] 2.341004
pt(t_ii, df = nrow(ii) - 1, lower.tail = FALSE)
## [1] 0.01329037
#
# #One Sample t-Test
t.test(ii$D6, mu = 60, alternative = "greater", conf.level = 0.05)
## 
##  One Sample t-test
## 
## data:  ii$D6
## t = 2.341, df = 28, p-value = 0.01329
## alternative hypothesis: true mean is greater than 60
## 5 percent confidence interval:
##  74.52782      Inf
## sample estimates:
## mean of x 
##  68.41379

Code (per Month Claim)

# #Data | Rename | Sum Months D6 | Change from Date Time to Date
bb <- xxJdata %>% 
  rename(Dates = "Month-Year") %>% 
  group_by(Dates) %>% 
  summarise(D6 = sum(D6)) %>% 
  mutate(across(Dates, as_date)) 
# 
# #There are missing months, but those months are not applicable in this question
# #Get relevant rows using filter()
ii <- bb %>% filter(Dates >= "2017-10-01")
#
t_ii <- {mean(ii$D6) - 60} / {sd(ii$D6) / sqrt(nrow(ii))}
print(t_ii)
## [1] 5.377075
pt(t_ii, df = nrow(ii) - 1, lower.tail = FALSE)
## [1] 0.000516808
#
# #One Sample t-Test
t.test(ii$D6, mu = 60, alternative = "greater", conf.level = 0.05)
## 
##  One Sample t-test
## 
## data:  ii$D6
## t = 5.3771, df = 7, p-value = 0.0005168
## alternative hypothesis: true mean is greater than 60
## 5 percent confidence interval:
##  314.2406      Inf
## sample estimates:
## mean of x 
##       248

Missing Months

str(bb)
## tibble [33 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Dates: Date[1:33], format: "2015-06-01" "2015-07-01" "2015-08-01" ...
##  $ D6   : num [1:33] 0 6 31 41 74 104 91 88 74 88 ...
summary(bb)
##      Dates                  D6       
##  Min.   :2015-06-01   Min.   :  0.0  
##  1st Qu.:2016-02-01   1st Qu.: 71.0  
##  Median :2016-10-01   Median : 91.0  
##  Mean   :2016-10-25   Mean   :130.2  
##  3rd Qu.:2017-09-01   3rd Qu.:198.0  
##  Max.   :2018-05-01   Max.   :365.0
#
# #Assuming each row is one month with no duplicates
stopifnot(identical(anyDuplicated(bb$Dates), 0L))
#
# #Create Sequence of Months
#ii <- seq(ymd("2015-6-1"), ymd("2018-5-1"), by = "months")
ii <- tibble(Dates = seq(min(bb$Dates), max(bb$Dates), by = "months"))
#
diff_len <- nrow(ii) - nrow(bb)
#
if(!identical(diff_len, 0L)) {
  cat(paste0("Number of missing months = ", diff_len, "\n"))
  #
  # #Find Values that should be in Complete Sequence but are missing in the data
  as_date(setdiff(ii$Dates, bb$Dates))
  # #OR
  ii %>% anti_join(bb)
  #
  # #This does not need a separate Vector of all Months
  # #Get Months Difference using Integer Division and 
  # #Filter Rows which are not consecutive and rows above them
  bb %>% 
    mutate(diff_months = (interval(lag(Dates), Dates)) %/% months(1)) %>% 
    filter( (diff_months != 1) | lead(diff_months != 1) )
}
## Number of missing months = 3
## # A tibble: 2 x 3
##   Dates         D6 diff_months
##   <date>     <dbl>       <dbl>
## 1 2017-05-01    51           1
## 2 2017-09-01   111           4
# #Fill Missing Months 
jj <- as_tibble(merge(bb, ii, by = "Dates", all = TRUE)) 
kk <- right_join(bb, ii, by = "Dates") %>% arrange(Dates)
stopifnot(identical(jj, kk))
# #Replace NA 
ll <- kk %>% mutate(across(D6, coalesce, 0)) 

7.3.3 Q2

Test the claim that Among the app users for disease information, at least 15% of them access disease information related to disease 6. \(({\alpha} = 0.05)\)

31.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

  • One Tail t-Test Caution: Proportion Test is needed.
    • Wrong Analysis:
      • \({}^U\!P_{t} = 0.094 \to {}^U\!P_{(t)} > {\alpha} \to {H_0}\) cannot be rejected.
      • We cannot conclude that at least 15% of users access D6 related information.
    • (Aside) Why using t-test is wrong here
      • 15% is NOT the \({\mu}\), it is proportion. Using that as Mean was wrong.
      • We are NOT looking for percentage of D6 lookup in each row or observation. It does not matter if one day farmers only looked for other diseases and another day farmers searched only D6. We are interested in overall share of D6 searches out of searches for Diseases
  • Proportion Test

Determine whether the proportion of farmers searching for D6 is more than \(p_0 = 0.15\)

31.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)

  • Count of Success \(({x})\) is Searches of D6
  • \(\{n = 26830, x = 4295\} \to {\overline{p}} = {n}/{x} = 0.160082\)
  • (31.5)
    • \({}^U\!P_{({\chi}^2)} < {\alpha} \to {H_0}\) is rejected i.e. the proportions are different
    • We can conclude that the proportion of farmers searching for D6 is more than \(p_0 = 0.15\) for all searches for Diseases.

Proportion Test

# #Data | Sum Disease, Variety, Micronutrients 
aa <- xxJdata %>% 
  mutate(sumD = rowSums(across(starts_with("D"))), 
         sumV = rowSums(across(starts_with("V")))) %>% 
  rename(Dates = "Month-Year", Users = "No of users", Micro = "Micronutrient") %>% 
  mutate(SUM = rowSums(across(c(sumD, sumV, Micro))),
         DIFF = Usage - SUM) %>% 
  select(Dates, Users, Usage, SUM, DIFF, sumD, sumV, Micro, D6) %>% 
  mutate(across(Dates, as_date)) %>% 
  mutate(Fraction = D6/sumD)
#
# #Confirmed that Usage is Sum Total of Disease, Variety, Micronutrients
unique(aa$DIFF)
## [1] 0
#
# #Working Set | Exclude 1 NA i.e. where sumD is zero
bb <- aa %>% drop_na(Fraction) %>% select(Usage, sumD, D6, Fraction)
#
# #Check n (Sample Count) and x (Count of Success)
bb %>% summarise(across(c(sumD, D6), sum))
## # A tibble: 1 x 2
##    sumD    D6
##   <dbl> <dbl>
## 1 26830  4295
#
# #One Sample Proportion Test with continuity correction
bb_prop <- prop.test(x = sum(bb$D6), n = sum(bb$sumD), p = 0.15, 
                     alternative = "greater", conf.level = 0.95)
bb_prop
## 
##  1-sample proportions test with continuity correction
## 
## data:  sum(bb$D6) out of sum(bb$sumD), null probability 0.15
## X-squared = 21.311, df = 1, p-value = 0.000001953
## alternative hypothesis: true p is greater than 0.15
## 95 percent confidence interval:
##  0.1564156 1.0000000
## sample estimates:
##        p 
## 0.160082

Proportion Test (Usage)

# #Impact if we try to evaluate propotion of D6 searches out of ALL Usage (Disease, Variety, Micro)
# #Check n (Sample Count) and x (Count of Success)
bb %>% summarise(across(c(Usage, D6), sum))
## # A tibble: 1 x 2
##   Usage    D6
##   <dbl> <dbl>
## 1 71646  4295
#
# #One Sample Proportion Test with continuity correction
# #With p-value = 1, we cannot claim that 15% searches are for D6 only out of ALL Usage
prop.test(x = sum(bb$D6), n = sum(bb$Usage), p = 0.15, 
                     alternative = "greater", conf.level = 0.95)
## 
##  1-sample proportions test with continuity correction
## 
## data:  sum(bb$D6) out of sum(bb$Usage), null probability 0.15
## X-squared = 4556.2, df = 1, p-value = 1
## alternative hypothesis: true p is greater than 0.15
## 95 percent confidence interval:
##  0.05849838 1.00000000
## sample estimates:
##          p 
## 0.05994752

One Sample t-Test

# #One Sample t-Test (Wrong)
if(FALSE) {
  t.test(bb$Fraction, mu = 0.15, alternative = "greater", conf.level = 0.05)
}

7.3.4 Q3

Test the claim that the average number of users in year 2017-2018 is more than average number of users in year 2015-2016. \(({\alpha} = 0.05)\)

32.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

  • If 1 is 2017-2018, 2 is 2015-2016 : Upper Tail Test is required
  • If 1 is 2015-2016, 2 is 2017-2018 : Lower Tail Test is required
  • Because the Variances are NOT same, Welch t-test is required, we cannot use Pooled Test
  • \(({}^L\!P_{t = -7.255} = {}^U\!P_{t = 7.255}) < {\alpha} \to {H_0}\) is rejected
    • The sample results provide sufficient evidence to conclude that the average number of users has increased.
  • Also, Test the claim that app usage picked up after January 2016. (Moved from Q4)
    • \({}^U\!P_{t} < {\alpha} \to {H_0}\) is rejected
    • The sample results provide sufficient evidence to conclude that the app usage has increased after January 2016.
  • NOTE: Ideally, each missing month should have been added 4 times for the 4 weeks. “ForLater”

Code

# #Data
bb <- xxJdata %>% 
  rename(Dates = "Month-Year", Users = "No of users") %>% 
  mutate(across(Dates, as_date)) %>% 
  select(Dates, Users, Usage)
#
# #Missing Months
ii <- tibble(Dates = seq(min(bb$Dates), max(bb$Dates), by = "months"))
jj <- right_join(bb, ii, by = "Dates") %>% arrange(Dates) %>% mutate(across(Users, coalesce, 0)) 
#
# #Create 2 Groups
jj$Year <- cut(jj$Dates, breaks = c(min(ii$Dates), as_date("2017-01-01"), Inf), 
               labels = c("2015-2016", "2017-2018"))
#
# #Verify Changes
jj[!duplicated(jj$Year), ]
## # A tibble: 2 x 4
##   Dates      Users Usage Year     
##   <date>     <dbl> <dbl> <fct>    
## 1 2015-06-01     2     4 2015-2016
## 2 2017-01-01    92   495 2017-2018
jj %>% filter(Dates %in% ymd(c("2016-12-01", "2017-01-01")))
## # A tibble: 8 x 4
##   Dates      Users Usage Year     
##   <date>     <dbl> <dbl> <fct>    
## 1 2016-12-01    50   536 2015-2016
## 2 2016-12-01    54   318 2015-2016
## 3 2016-12-01    99   558 2015-2016
## 4 2016-12-01   104   573 2015-2016
## 5 2017-01-01    92   495 2017-2018
## 6 2017-01-01   130   578 2017-2018
## 7 2017-01-01    87   436 2017-2018
## 8 2017-01-01    60   261 2017-2018
#
# #For Two Sample t-test, check if Variances are equal
jj_var <- var.test(Users ~ Year, data = jj)
jj_var
## 
##  F test to compare two variances
## 
## data:  Users by Year
## F = 0.10358, num df = 72, denom df = 52, p-value < 0.00000000000000022
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.06158444 0.17055359
## sample estimates:
## ratio of variances 
##          0.1035754
#
# #If Variances are Equal, Pooled Test otherwise Welch Test
isVarEqual <- ifelse(jj_var$p.value > 0.05, TRUE, FALSE)
#
# #Because 1 is "2015-2016", 2 is "2017-2018", we need to perform Lower Tail Test
jj_t <- t.test(formula = Users ~ Year, data = jj, alternative = "less", var.equal = isVarEqual)
jj_t
## 
##  Welch Two Sample t-test
## 
## data:  Users by Year
## t = -7.255, df = 59.87, p-value = 0.0000000004641
## alternative hypothesis: true difference in means between group 2015-2016 and group 2017-2018 is less than 0
## 95 percent confidence interval:
##       -Inf -92.96694
## sample estimates:
## mean in group 2015-2016 mean in group 2017-2018 
##                50.06849               170.84906
#
# #Alternatively, we can reverse Factor levels to perform Upper Tail Test
kk <- jj
kk$Year <- factor(kk$Year, levels = rev(levels(jj$Year)))
#
t.test(formula = Users ~ Year, data = kk, alternative = "greater", var.equal = isVarEqual)
## 
##  Welch Two Sample t-test
## 
## data:  Users by Year
## t = 7.255, df = 59.87, p-value = 0.0000000004641
## alternative hypothesis: true difference in means between group 2017-2018 and group 2015-2016 is greater than 0
## 95 percent confidence interval:
##  92.96694      Inf
## sample estimates:
## mean in group 2017-2018 mean in group 2015-2016 
##               170.84906                50.06849

4b

# #Data
str(jj)
## tibble [126 x 4] (S3: tbl_df/tbl/data.frame)
##  $ Dates: Date[1:126], format: "2015-06-01" "2015-07-01" "2015-07-01" ...
##  $ Users: num [1:126] 2 1 1 4 6 12 13 10 7 12 ...
##  $ Usage: num [1:126] 4 1 25 70 100 291 225 141 148 215 ...
##  $ Year : Factor w/ 2 levels "2015-2016","2017-2018": 1 1 1 1 1 1 1 1 1 1 ...
# #For Two Sample t-test, check if Variances are equal
jj_var <- var.test(Usage ~ Year, data = jj)
jj_var
## 
##  F test to compare two variances
## 
## data:  Usage by Year
## F = 0.8706, num df = 72, denom df = 49, p-value = 0.5853
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5117114 1.4434337
## sample estimates:
## ratio of variances 
##          0.8705962
#
# #If Variances are Equal, Pooled Test otherwise Welch Test
isVarEqual <- ifelse(jj_var$p.value > 0.05, TRUE, FALSE)
#
# #Because 1 is "2015-2016", 2 is "2017-2018", we need to perform Lower Tail Test
t.test(formula = Usage ~ Year, data = jj, alternative = "less", var.equal = isVarEqual)
## 
##  Two Sample t-test
## 
## data:  Usage by Year
## t = -4.7369, df = 121, p-value = 0.000002981
## alternative hypothesis: true difference in means between group 2015-2016 and group 2017-2018 is less than 0
## 95 percent confidence interval:
##     -Inf -252.37
## sample estimates:
## mean in group 2015-2016 mean in group 2017-2018 
##                424.6849                812.9000

7.3.5 Q4

Check whether app usage is same or different across the four weeks of a month. Test the claim that app usage picked up after January 2016. (Answered with Q3)

NOTE: Question is ‘check whether app usage is same or different across the four weeks of a month, using Jan-2016 - May-2018 data.’ However, as seen in the figure 7.1, this time period has 3 months missing data and completely different usage pattern after that, I believe that testing only this data would give biased results. So, this was not done.

35.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)

  • ANOVA is needed because more than two means are to be compared
    • Data groups are not normal and neither the variances are equal
    • Log transformation failed, Residual Test failed
      • NOTE: Data shown in class is subset “Jan-2016 - May-2018” i.e. log transformation of this dataset becomes normal.
    • However, Sqrt transformation passed Normality Test as well as Variances are found to be same
  • Using Sqrt of Data \({}^U\!P_{(F)} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude that the app usage is different across the four weeks of a month.

Question: When ANOVA is done on transformed data and a conclusion is reached. Does this imply that the original data would also follow same conclusion - Look at the p-value of Transformed data for accepting or rejecting the Hypothesis. But, look at the mean of original data to apply those conclusions. - ANOVA p-value is NOT trustworthy if the data is NOT Normal.

Question: When we are running any test, should we check whether the data is normal - Yes.

Images

(B15P04 B15P05) JAT: QQ Plot of Usage and Sqrt(Usage)(B15P04 B15P05) JAT: QQ Plot of Usage and Sqrt(Usage)

Figure 7.2 (B15P04 B15P05) JAT: QQ Plot of Usage and Sqrt(Usage)

Anova & Kruskal

# #Data | Missing Months can be ignored because those are missing across all weeks
bb <- xxJdata %>% 
  rename(Dates = "Month-Year", Users = "No of users") %>% 
  select(Week, Usage)
#
str(bb)
## tibble [123 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Week : chr [1:123] "Week4" "Week1" "Week2" "Week3" ...
##  $ Usage: num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
summary(bb)
##      Week               Usage       
##  Length:123         Min.   :   1.0  
##  Class :character   1st Qu.: 286.5  
##  Mode  :character   Median : 450.0  
##                     Mean   : 582.5  
##                     3rd Qu.: 749.5  
##                     Max.   :3462.0
#
# #ANOVA (on original data : neither normal, nor of equal variance)
bb_aov <- aov(formula = Usage ~ Week, data = bb)
#
# #
model.tables(bb_aov, type = "means")
## Tables of means
## Grand mean
##          
## 582.4959 
## 
##  Week 
##     Week1 Week2 Week3 Week4
##     551.9 522.2   480 764.7
## rep  31.0  30.0    30  32.0
#
# #Summary
summary(bb_aov)
##              Df   Sum Sq Mean Sq F value Pr(>F)  
## Week          3  1515178  505059    2.22 0.0894 .
## Residuals   119 27074553  227517                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
bb_aov
## Call:
##    aov(formula = Usage ~ Week, data = bb)
## 
## Terms:
##                     Week Residuals
## Sum of Squares   1515178  27074553
## Deg. of Freedom        3       119
## 
## Residual standard error: 476.9877
## Estimated effects may be unbalanced
#
kruskal.test(Usage ~ Week, data = bb)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Usage by Week
## Kruskal-Wallis chi-squared = 3.1614, df = 3, p-value = 0.3674
#
# #Poisson Test (ForLater)
#anova(glm(Usage ~ Week, data = ii, family = poisson), test = "LRT")
#
# #Transformation: Square Root Data
ii <- bb %>% mutate(Week = factor(Week)) %>% mutate(Sqrt = sqrt(Usage))
str(ii)
## tibble [123 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Week : Factor w/ 4 levels "Week1","Week2",..: 4 1 2 3 4 1 2 3 4 1 ...
##  $ Usage: num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
##  $ Sqrt : num [1:123] 2 1 5 8.37 10 ...
summary(ii)
##     Week        Usage             Sqrt      
##  Week1:31   Min.   :   1.0   Min.   : 1.00  
##  Week2:30   1st Qu.: 286.5   1st Qu.:16.93  
##  Week3:30   Median : 450.0   Median :21.21  
##  Week4:32   Mean   : 582.5   Mean   :22.35  
##             3rd Qu.: 749.5   3rd Qu.:27.38  
##             Max.   :3462.0   Max.   :58.84
#
# #ANOVA
ii_aov <- aov(formula = Sqrt ~ Week, data = ii)
# #
model.tables(ii_aov, type = "means")
## Tables of means
## Grand mean
##          
## 22.35259 
## 
##  Week 
##     Week1 Week2 Week3 Week4
##     21.98 21.57 20.74 24.95
## rep 31.00 30.00 30.00 32.00
#
# #Summary
summary(ii_aov)
##              Df Sum Sq Mean Sq F value Pr(>F)
## Week          3    317  105.75   1.274  0.286
## Residuals   119   9874   82.98
#
kruskal.test(Sqrt ~ Week, data = ii)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Sqrt by Week
## Kruskal-Wallis chi-squared = 3.1614, df = 3, p-value = 0.3674

Variance

(External) Variances in R

Statistical tests for comparing the variances of two or more samples. Equal variances across samples is called homogeneity of variances.

  • Reason
    • Two independent samples T-test and ANOVA test, assume that variances are equal across groups.
  • Statistical tests for comparing variances
    • F-test: Compare the variances of two samples. The data must be normally distributed.
    • Bartlett test: Compare the variances of k samples, where k can be more than two samples.
      • The data must be normally distributed.
      • The Levene test is an alternative to the Bartlett test that is less sensitive to departures from normality.
    • Levene test: Compare the variances of k samples, where k can be more than two samples.
      • It is an alternative to the Bartlett test that is less sensitive to departures from normality.
    • Fligner-Killeen test: a non-parametric test which is very robust against departures from normality.
  • Hypothesis
    • For all these tests (Bartlett test, Levene test or Fligner-Killeen test):
Definition 7.1 \(\text{\{Variances\}} {H_0} : {\sigma}_1 = {\sigma}_2 = \dots = {\sigma}_k \iff {H_a}: \text{At least two variances differ.}\)
  • On this data, all 3 tests have p-value less than 0.05, i.e. Variances are NOT same
  • On the Transformed Data (Sqrt), Levene Test and Fligner Test fail to detect difference in Variances
# #Data
str(bb)
## tibble [123 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Week : chr [1:123] "Week4" "Week1" "Week2" "Week3" ...
##  $ Usage: num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
summary(bb)
##      Week               Usage       
##  Length:123         Min.   :   1.0  
##  Class :character   1st Qu.: 286.5  
##  Mode  :character   Median : 450.0  
##                     Mean   : 582.5  
##                     3rd Qu.: 749.5  
##                     Max.   :3462.0
#
# #Bartlett Test
bartlett.test(Usage ~ Week, data = bb)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Usage by Week
## Bartlett's K-squared = 25.89, df = 3, p-value = 0.00001006
#
# #Levene Test
ii <- bb %>% mutate(Week = factor(Week))
leveneTest(Usage ~ Week, data = ii)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)  
## group   3  2.8691 0.0394 *
##       119                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# #Fligner-Killeen test
fligner.test(Usage ~ Week, data = bb)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  Usage by Week
## Fligner-Killeen:med chi-squared = 8.5854, df = 3, p-value = 0.03534
#
# #Transformation: Square Root Data
ii <- bb %>% mutate(Week = factor(Week)) %>% mutate(Sqrt = sqrt(Usage))
bartlett.test(Sqrt ~ Week, data = ii)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Sqrt by Week
## Bartlett's K-squared = 10.656, df = 3, p-value = 0.01374
leveneTest(Sqrt ~ Week, data = ii)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   3  2.4529 0.06668 .
##       119                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fligner.test(Sqrt ~ Week, data = ii)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  Sqrt by Week
## Fligner-Killeen:med chi-squared = 6.6688, df = 3, p-value = 0.08324

Normality

# #Are the data from each of the 4 groups follow a normal distribution
# #Shapiro-Wilk normality test
bb %>% mutate(Week = factor(Week)) %>% 
  group_by(Week) %>% 
  summarise(N = n(), Mean = mean(Usage), SD = sd(Usage),
            p_Shapiro = shapiro.test(Usage)$p.value, 
            isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 6
##   Week      N  Mean    SD p_Shapiro isNormal
##   <fct> <int> <dbl> <dbl>     <dbl> <lgl>   
## 1 Week1    31  552.  384. 0.00631   FALSE   
## 2 Week2    30  522.  353. 0.0173    FALSE   
## 3 Week3    30  480.  333. 0.000843  FALSE   
## 4 Week4    32  765.  714. 0.0000964 FALSE
# #Check Q-Q plot
#qqnorm(bb[bb$Week == "Week1", ]$Usage)
#
# #Transformation: Log (Did not pass Normality)
bb %>% mutate(Week = factor(Week)) %>% 
  mutate(Log = log(Usage)) %>% 
  group_by(Week) %>% 
  summarise(p_Shapiro = shapiro.test(Log)$p.value, 
            isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
##   Week    p_Shapiro isNormal
##   <fct>       <dbl> <lgl>   
## 1 Week1 0.000000533 FALSE   
## 2 Week2 0.0314      FALSE   
## 3 Week3 0.689       TRUE    
## 4 Week4 0.00125     FALSE
#
# #Transformation: Square Root (Success: Passed Normality) - Selected
bb %>% mutate(Week = factor(Week)) %>% 
  mutate(Sqrt = sqrt(Usage)) %>% 
  group_by(Week) %>% 
  summarise(p_Shapiro = shapiro.test(Sqrt)$p.value, 
            isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
##   Week  p_Shapiro isNormal
##   <fct>     <dbl> <lgl>   
## 1 Week1     0.493 TRUE    
## 2 Week2     0.967 TRUE    
## 3 Week3     0.108 TRUE    
## 4 Week4     0.732 TRUE
#
# #Transformation: Cube Root (Success: Passed Normality) Just to check
bb %>% mutate(Week = factor(Week)) %>% 
  mutate(CubeRoot = Usage^(1/3)) %>% 
  group_by(Week) %>% 
  summarise(p_Shapiro = shapiro.test(CubeRoot)$p.value, 
            isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
##   Week  p_Shapiro isNormal
##   <fct>     <dbl> <lgl>   
## 1 Week1    0.0979 TRUE    
## 2 Week2    0.980  TRUE    
## 3 Week3    0.359  TRUE    
## 4 Week4    0.997  TRUE
#
# #Testing Residuals i.e. Data - Group Mean (Did not pass Normality)
bb %>% mutate(Week = factor(Week)) %>% 
  group_by(Week) %>% 
  mutate(Residuals = Usage - mean(Usage)) %>% 
  summarise(p_Shapiro = shapiro.test(Residuals)$p.value, 
            isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
##   Week  p_Shapiro isNormal
##   <fct>     <dbl> <lgl>   
## 1 Week1 0.00631   FALSE   
## 2 Week2 0.0173    FALSE   
## 3 Week3 0.000843  FALSE   
## 4 Week4 0.0000964 FALSE

QQ Plot

bb <- xxJdata %>% 
  rename(Dates = "Month-Year", Users = "No of users") %>% 
  mutate(Week = factor(Week)) %>% 
  mutate(Sqrt = sqrt(Usage)) %>% 
  select(Week, Usage, Sqrt)
#
hh <- bb
ttl_hh <- "QQ Plot of Usage"
cap_hh <- "B15P04"
#
B15 <- hh %>% { ggplot(., aes(sample = Usage, colour = Week)) +
    stat_qq() +
    stat_qq_line() +
    labs(caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B15)
rm(B15)
#
ttl_hh <- "QQ Plot of Sqrt(Usage)"
cap_hh <- "B15P05"
B15 <- hh %>% { ggplot(., aes(sample = Sqrt, colour = Week)) +
    stat_qq() +
    stat_qq_line() +
    labs(caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B15)
rm(B15)

7.3.6 Q5

32.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)

A new version of the app was released in August-2016. Which month in the given time frame after the launch of the new version, the mean usage pattern would start to show a statistically significant shift

  • (1: OldApp, 2: NewApp) Lower Tail Test is required
  • Because the Variances are NOT same, Welch t-test is required, we cannot use Pooled Test
  • \({}^L\!P_{t} < {\alpha} \to {H_0}\) is rejected
    • The sample results provide sufficient evidence to conclude that the mean usage has increased after August-2016.

Basic

# #Data
bb <- xxJdata %>% rename(Dates = "Month-Year") %>% mutate(across(Dates, as_date)) %>% 
  select(Dates, Week, Usage)
#
# #Create 2 Groups
bb$Year <- cut(bb$Dates, breaks = c(min(bb$Dates), as_date("2016-08-01"), Inf), 
               labels = c("OldApp", "NewApp"))
#
# #For Two Sample t-test, check if Variances are equal
bb_var <- var.test(Usage ~ Year, data = bb)
bb_var
## 
##  F test to compare two variances
## 
## data:  Usage by Year
## F = 0.16799, num df = 52, denom df = 69, p-value = 0.0000000004775
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1014182 0.2835174
## sample estimates:
## ratio of variances 
##          0.1679867
#
# #If Variances are Equal, Pooled Test otherwise Welch Test
isVarEqual <- ifelse(bb_var$p.value > 0.05, TRUE, FALSE)
#
# #Because 1 is "OldApp", 2 is "NewApp", we need to perform Lower Tail Test
bb_t <- t.test(formula = Usage ~ Year, data = bb, alternative = "less", var.equal = isVarEqual)
bb_t
## 
##  Welch Two Sample t-test
## 
## data:  Usage by Year
## t = -6.0938, df = 96.698, p-value = 0.00000001124
## alternative hypothesis: true difference in means between group OldApp and group NewApp is less than 0
## 95 percent confidence interval:
##       -Inf -317.4977
## sample estimates:
## mean in group OldApp mean in group NewApp 
##             334.1132             770.5571
#

rollapply()

# #Rolling Sums
old_sd <- sd(bb[bb$Year == "OldApp", ]$Usage)
old_n <- summary(bb$Year)[1]
#
ii <- bb %>% filter(Year == "NewApp", Dates <= '2017-05-01') %>% select(-Year)
#
jj <- ii %>% mutate(ID = row_number(), cSUM = cumsum(Usage), cMean = cSUM/ID) %>% 
  mutate(SD = across(Usage, ~ rollapply(., ID, sd, fill = NA, align = "right"))) %>% 
  mutate(DOF = floor({SD^2 / ID + old_sd^2 / old_n }^2 / 
                       {{SD^2 / ID}^2/{ID-1} + {old_sd^2 / old_n}^2/{old_n-1}})) %>% 
  mutate(Sigma = sqrt({SD^2 /ID + old_sd^2 /old_n}))
str(jj)
## tibble [40 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Dates: Date[1:40], format: "2016-08-01" "2016-08-01" "2016-08-01" ...
##  $ Week : chr [1:40] "Week1" "Week2" "Week3" "Week4" ...
##  $ Usage: num [1:40] 421 387 264 788 691 256 261 377 295 749 ...
##  $ ID   : int [1:40] 1 2 3 4 5 6 7 8 9 10 ...
##  $ cSUM : num [1:40] 421 808 1072 1860 2551 ...
##  $ cMean: num [1:40] 421 404 357 465 510 ...
##  $ SD   : tibble [40 x 1] (S3: tbl_df/tbl/data.frame)
##   ..$ Usage: num [1:40] NA 24 82.6 225.6 220 ...
##  $ DOF  :'data.frame':   40 obs. of  1 variable:
##   ..$ Usage: num [1:40] NA 14 3 3 4 6 7 9 11 13 ...
##  $ Sigma:'data.frame':   40 obs. of  1 variable:
##   ..$ Usage: num [1:40] NA 34.9 56.6 116.9 103 ...

7.3.7 Q6

If a disease is likely to spread in particular weather condition (data given in the disease index sheet), then the access of that disease should be more in the months having suitable weather conditions. Help the analyst in coming up with a statistical test to support the claim for two districts for which the sample of weather and disease access data is provided in the data sheet. Identify the diseases for which you can support this claim. Test this claim both for temperature and relative humidity at 95% confidence.

32.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

Table 7.1: (B15T01) Q6: Diseases p-value (Upper)
p- Var Equal Var p- t-test Upper H0
D1 ~ iD1 0.00002 FALSE 0.00084 Rejected
D2 ~ iD2 0.00009 FALSE 0.00076 Rejected
D3 ~ iD3 0.0002 FALSE 0.00399 Rejected
D4 ~ iD4 0.00029 FALSE 0.06659 Not Rejected
D5 ~ iD5 0.37691 TRUE 0.01814 Rejected
D7 ~ iD7 0.90392 TRUE 0.00239 Rejected
  • (1: Conditions Favourable for Disease i.e. TRUE, 2: FALSE) Upper Tail Test is required
  • Testing variance and applying Welch t-test (variances not same) or Pooled Test (variances are same)
  • Assuming both districts to be part of same sample
  • Values are given in the 7.1
    • \({}^U\!P_{t} > {\alpha} \to {H_0}\) cannot be rejected for disease 4
    • \({}^U\!P_{t} < {\alpha} \to {H_0}\) is rejected for diseases 1, 2, 3, 5, and 7
    • The sample results provide sufficient evidence to conclude that the claim is True for diseases 1, 2, 3, 5, and 7
    • Upper Tail Test was performed in place of Two Tail because it is assumed that there is no benefit in thinking that searches for a disease may decrease during the favourable conditions for it.
    • Both districts were combined with the thinking that political boundaries may not change the behaviour of people based on geographical factors (T & RH).
      • However, a case can be made when we are interested in observing difference in pattern of behaviour by the people of different districts. In that case, we would be performing a difference of means test between the two districts.

BoxPlot

(B15P03) JAT: Disease Searches grouped with Faourable and unfavourable Condisions (T and RH)

Figure 7.3 (B15P03) JAT: Disease Searches grouped with Faourable and unfavourable Condisions (T and RH)

Code

# #Merge both dataframes of Two districts
aa <- bind_rows(Belagavi = xxJbela, Dharwad = xxJdhar, .id = 'source') %>% 
  rename(Dates = Months, RH = "Relative Humidity", TMP = "Temperature") %>% 
  mutate(across(Dates, as_date)) %>% mutate(source = factor(source)) %>% 
  select(-c(10:13)) %>% select(-D6)
#
# #Based on Conditional T & RH, get each disease favourable condition = TRUE
q6_bb <- aa %>% mutate(iD1 = ifelse(TMP <= 24 & TMP >= 20 & RH > 80, TRUE, FALSE), 
                    iD2 = ifelse(TMP <= 24.5 & TMP >= 21.5 & RH > 83, TRUE, FALSE), 
                    iD3 = ifelse(TMP <= 24 & TMP >= 22, TRUE, FALSE), 
                    iD4 = ifelse(TMP <= 26 & TMP >= 22 & RH > 85, TRUE, FALSE), 
                    iD5 = ifelse(TMP <= 24.5 & TMP >= 22 & RH <= 85 & RH >= 77, TRUE, FALSE), 
                    iD7 = ifelse(TMP > 25 & RH > 80, TRUE, FALSE)) %>% 
  mutate(across(starts_with("i"), factor, levels = c(TRUE, FALSE)))
bb <- q6_bb
#
# #Create all Formulae for variance and t-test
formulas <- paste0(names(bb)[3:8], " ~ ", names(bb)[11:16])
#
# #Appply formulae
output <- t(sapply(formulas, function(f) {
    test_var <- var.test(as.formula(f), data = bb)
    isVarEqual <- ifelse(test_var$p.value > 0.05, TRUE, FALSE) 
    test_t <- t.test(as.formula(f), data = bb, alternative = "greater", var.equal = isVarEqual)
    c("p- Var" = format(round(test_var$p.value, 5), scientific = FALSE), 
      "Equal Var" = ifelse(test_var$p.value > 0.05, TRUE, FALSE), 
      "p- t-test" = format(round(test_t$p.value, 5), scientific = FALSE), 
      "Upper H0" = ifelse(test_t$p.value > 0.05, "Not Rejected", "Rejected"))
}))

BoxPlot More

bb <- q6_bb
hh <- q6_bb %>% 
  rename_with(~gsub("iD", "i", .x)) %>% 
  select(starts_with(c("D", "i"))) %>% 
  select(-Dates) %>% 
  pivot_longer(everything(), names_to = c(".value", "Disease"), names_pattern = "(.)(.)") %>%
  rename(Values = "D", Favourable = "i")
#
ttl_hh <- "BoxPlot of Searches for Diseases in both districts"
cap_hh <- "B15P03"
#
B15 <- hh %>% { ggplot(data = ., mapping = aes(x = Disease, y = Values, fill = Favourable)) +
        geom_boxplot(outlier.shape = NA) +
        #stat_summary(fun = mean, geom = "point", size = 2, color = "steelblue") + 
        #scale_y_continuous(breaks = seq(0, 110, 10), limits = c(0, 110)) +
        geom_point(position = position_jitterdodge(jitter.width = 0.1), 
                   size = 1, alpha = 0.7, colour = "#21908CFF") + 
        k_gglayer_box +
        theme(
            #legend.justification = c("right", "top"),
            #legend.box.just = "right",
            #legend.margin = margin(6, 6, 6, 6),
            legend.position = c(.90, .95)
        ) +
        labs(x = "Diseases", y = "Searches per month", fill = "Favourable",
             caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B15)
rm(B15)

Pattern Match

# #rename_with() uses formula
# #Selection Helpers like starts_with() can take multiple conditions
# #pivot_longer() can return multiple groups
# #Pattern Match: Both First and Second Pattern can contain only 1 character
ii <- q6_bb %>% 
  rename_with(~gsub("iD", "i", .x)) %>% 
  select(starts_with(c("D", "i"))) %>% select(-Dates) %>% 
  pivot_longer(everything(), names_to = c(".value", "Disease"), names_pattern = "(.)(.)") %>%
  rename(Values = "D", Favourable = "i")
#
# #First Pattern can contain 1 or more characters but the Second can have only 1 character
jj <- q6_bb %>% 
  select(starts_with(c("D", "i"))) %>% select(-Dates) %>% 
  pivot_longer(everything(), names_to = c(".value", "Disease"), names_pattern = "(.*)(.)") %>% 
  rename(Values = "D", Favourable = iD)
stopifnot(identical(ii, jj))

Validation


8 Data Preprocessing (B16, Oct-24)

8.1 Overview

8.2 Packages

if(FALSE){# #WARNING: Installation may take some time.
  install.packages("mice", dependencies = TRUE)
  install.packages("car", dependencies = TRUE)
}

8.3 Data

Please import the "B16-Cars2.csv"

Cars

Table 8.1: (B16T01) Cars Data (Head)
sl.No. mpg cylinders cubicinches hp weightlbs time.to.60 year brand
1 14.0 8 350 165 4209 12 1972 US
2 31.9 4 89 71 1925 14 1980 Europe
3 17.0 8 302 140 3449 11 1971 US
4 15.0 8 400 150 3761 10 1971 US
5 30.5 4 98 63 2051 17 1978 US
6 23.0 8 350 125 3900 17 1980 US

Structure

# #Structure
str(xxB16Cars)
## spec_tbl_df [263 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ sl.No.     : num [1:263] 1 2 3 4 5 6 7 8 9 10 ...
##  $ mpg        : num [1:263] 14 31.9 17 15 30.5 23 13 14 25.4 37.7 ...
##  $ cylinders  : num [1:263] 8 4 8 8 4 8 8 8 5 4 ...
##  $ cubicinches: num [1:263] 350 89 302 400 98 350 351 440 183 89 ...
##  $ hp         : num [1:263] 165 71 140 150 63 125 158 215 77 62 ...
##  $ weightlbs  : num [1:263] 4209 1925 3449 3761 2051 ...
##  $ time.to.60 : num [1:263] 12 14 11 10 17 17 13 9 20 17 ...
##  $ year       : num [1:263] 1972 1980 1971 1971 1978 ...
##  $ brand      : chr [1:263] "US" "Europe" "US" "US" ...

Summary

# #Summary
summary(xxB16Cars)
##      sl.No.           mpg           cylinders      cubicinches          hp          weightlbs     
##  Min.   :  1.0   Min.   : 10.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   : 192.5  
##  1st Qu.: 66.5   1st Qu.: 16.95   1st Qu.:4.000   1st Qu.:103.0   1st Qu.: 75.5   1st Qu.:2245.5  
##  Median :132.0   Median : 22.00   Median :6.000   Median :156.0   Median : 95.0   Median :2830.0  
##  Mean   :132.0   Mean   : 25.07   Mean   :5.593   Mean   :201.5   Mean   :106.3   Mean   :2992.9  
##  3rd Qu.:197.5   3rd Qu.: 28.90   3rd Qu.:8.000   3rd Qu.:302.0   3rd Qu.:137.5   3rd Qu.:3654.5  
##  Max.   :263.0   Max.   :527.00   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :4997.0  
##    time.to.60         year         brand          
##  Min.   : 8.00   Min.   :1971   Length:263        
##  1st Qu.:14.00   1st Qu.:1974   Class :character  
##  Median :16.00   Median :1977   Mode  :character  
##  Mean   :15.54   Mean   :1977                     
##  3rd Qu.:17.00   3rd Qu.:1980                     
##  Max.   :25.00   Max.   :1983

8.4 Missing Data

8.5 Imputation

Definition 8.1 Imputation is the process of replacing missing data with substituted values. Imputation preserves all cases by replacing missing data with an estimated value based on other available information.
  • Imputation
    • What would be the most likely value for this missing value, given all the other attributes for a particular record
    • “Multivariate Imputation by Chained Equations” (MICE) is used for Imputation.
  • A common method of “handling” missing values is simply to omit the records or fields with missing values from the analysis. However, there are issues with this approach.
    • Sometimes it is not feasible or desirable to delete all the rows containing missing values
    • Sometimes the data is deliberately missing
      • Example: Customers with high value transactions have higher missing income data
        • This is also a pattern and we cannot simply remove these customers from our analysis
    • Question: What if there are only a small number of missing values in a large dataset
      • Even then also, we need to look at whether it is a meaningful data or not. Some data points are so critical that they cannot be removed. If you delete them, you might be loosing highly relevant information.
    • Question: Only 15 rows in a 1-lakh dataset
      • Do the rows belong to a new product which is relevant to our analysis
      • While these rows may not have mileage information of a car, other columns would have pricing and other critical information. Deleting these rows may result in loss of this important data.
      • It would be a waste to omit the information in all the other fields, just because one field value is missing.
  • Mean value replacement
    • Refer Mean
    • Replace the missing value by ‘mean’ of the data
    • However, the ‘mean’ is affected by presence of extreme values (outliers)
  • Median replacement
    • Refer Median
    • Although the mean is the more commonly used measure of central location, whenever a data set contains extreme values, the median is preferred.
    • Median is always a better measure for replacement compared to mean
  • Mode replacement
    • Refer Mode
    • For Categorical variable, mode is preferred.
    • However, using mode for NA replacement results in increasing the frequency of most frequent item.
  • Random value replacement
    • Replace NA with a random value from the observed distribution of the variable.
    • However, the resulting observation (row) might not make sense in terms of grouping of variables (columns).
  • Problems:
    • These simple approaches usually introduce bias into the data
      • Ex: Applying mean substitution leaves the mean unchanged (desirable) but decreases variance (undesirable). The resulting confidence levels for statistical inference will be overoptimistic, as measures of spread will be artificially reduced.

8.5.1 Introduce NA

aa <- xxB16Cars #No missing value
bb <- aa     #Will have missing value later
#
# #Identify the Number of Missing Values
if(anyNA(bb)) {
  cat(paste0("NA are Present! Total NA = ", sum(is.na(bb)), "\n")) 
  } else cat(paste0("NA not found.\n"))
## NA not found.
#
# #Record Some Values, before deleteting them
bb_22 <- bb$mpg[2]        #bb[2, 2] #31.9
bb_39 <- bb$brand[3]      #bb[3, 9] #"US"
bb_43 <- bb$cylinders[4]  #bb[4, 3] #8
#
# #Delete 
bb$mpg[2] <- bb$brand[3] <- bb$cylinders[4] <- NA
#
# #Identify the Number of Missing Values
cat(paste0("NA are Present! Total NA = ", sum(is.na(bb)), "\n")) 
## NA are Present! Total NA = 3
#
# #Which Columns have NA
#summary(bb)
bb_na_col <- colSums(is.na(bb))
#
# #Column Names with their Column Index
which(bb_na_col != 0)
##       mpg cylinders     brand 
##         2         3         9
#
# #Number of NA in each Column
bb_na_col[which(bb_na_col != 0)]
##       mpg cylinders     brand 
##         1         1         1
#
# #How many rows contain NA
sum(!complete.cases(bb))
## [1] 3
#
# #Indices of Rows with NA
head(which(!complete.cases(bb)))
## [1] 2 3 4
#

8.5.2 summary() NA

  • Caution:
    • summary() does not identify NA in character column but shows them in factor
    • table() does not identify NA by default but shows them by ‘useNA’
ii <- bb
summary(ii$brand)
##    Length     Class      Mode 
##       263 character character
summary(factor(ii$brand))
## Europe  Japan     US   NA's 
##     48     51    163      1
#
# #table() by default does not show NA even in factor. However it has 'useNA' option
table(ii$brand)
## 
## Europe  Japan     US 
##     48     51    163
table(factor(ii$brand))
## 
## Europe  Japan     US 
##     48     51    163
table(ii$brand, useNA = "always")
## 
## Europe  Japan     US   <NA> 
##     48     51    163      1

8.5.3 Mean and Median Replacement

  • ‘na.rm = TRUE‘: All NA values will be ignored for the calculation.
ii <- bb
# #Mean Replacement
#ii %>% mutate(across(mpg, ~ replace(., is.na(.), round(mean(mpg, na.rm = TRUE), 2))))
ii$mpg[which(is.na(ii$mpg))] <- round(mean(ii$mpg, na.rm = TRUE), digits = 2)
jj <- ii
#
# #Median Replacement
ii <- bb
#ii %>% mutate(across(mpg, replace_na, round(median(mpg, na.rm = TRUE), 2)))
ii$mpg[which(is.na(ii$mpg))] <- round(median(ii$mpg, na.rm = TRUE), digits = 2)

8.5.4 Mode Replacement

table(bb$brand, useNA = "always")
## 
## Europe  Japan     US   <NA> 
##     48     51    163      1
bb %>% group_by(brand) %>% summarise(n()) 
## # A tibble: 4 x 2
##   brand  `n()`
##   <chr>  <int>
## 1 Europe    48
## 2 Japan     51
## 3 US       163
## 4 <NA>       1
#
# #Mode Replacement
#ii$brand[which(is.na(ii$brand))] <- f_getMode(ii$brand)
ii$brand[which(is.na(ii$brand))] <- "US"
#
# #Caution: Do not use max() on "character" for mode replacement
# #It will only look for ASCII value of letters
ii <- c("a", "z", "c", "b", "b", "b", "USA", NA, "a")
max(ii, na.rm = TRUE) #Wrong Value
## [1] "z"
f_getMode(ii) 
## [1] "b"

8.5.5 md.pattern()

  • mice::md.pattern()
    • It gives Number of NA in each column and those Rows which have missing NA
    • This is useful when there is correlation in the missing values of two or more columns
# #Convert to Factor before using MICE
bb$brand <- factor(bb$brand)
#
# #mice::md.pattern() 
na_bb <- md.pattern(bb, plot = FALSE)
na_bb
##     sl.No. cubicinches hp weightlbs time.to.60 year mpg cylinders brand  
## 260      1           1  1         1          1    1   1         1     1 0
## 1        1           1  1         1          1    1   1         1     0 1
## 1        1           1  1         1          1    1   1         0     1 1
## 1        1           1  1         1          1    1   0         1     1 1
##          0           0  0         0          0    0   1         1     1 3
(B16P01) Cars: Inserted Missing Value Pattern by md.pattern()

Figure 8.1 (B16P01) Cars: Inserted Missing Value Pattern by md.pattern()

8.5.6 Seed

  • set.seed()
    • Random Number Generation can be fixed by a Seed for Reproducibility or Replication
    • The number given as seed ‘3’ is not meaningful. It can be anything.
    • However, it is recommended to use same number as seed throughout the calculations to avoid perception of fixing the values.
  • Question: Is it for learning purpose and not for real world data
    • No, it is required for reproducibility
# #Choose Two Numbers from 1:10, Randomly
sample(1:10, 2)
## [1] 7 2
sample(1:10, 2)
## [1] 7 4
sample(1:10, 2)
## [1]  3 10
# #All above calls to generate Two random numbers produce different outcomes
# #Using set.seed() we can regenerate same random numbers everytime
set.seed(3)
sample(1:10, 2) 
## [1] 5 7
sample(1:10, 2)
## [1] 4 8
#
# #If we re-fix the seed, the counter works along same pathway and re-generate numbers
set.seed(3)
sample(1:10, 2)
## [1] 5 7
sample(1:10, 2)
## [1] 4 8

8.5.7 mice()

MICE

  • package:mice is used for imputation
    • “Multivariate Imputation by Chained Equations”
      • It does imputation of a factor vector based on a numeric vector
    • Question: What happens if any categorical variable is associated with multiple numeric values. For example, if Car Honda has multiple mileage values
      • It will look at all other columns of data and based on these multiple columns identify a pattern which will be used for imputation.
    • Advantage:
      • During the mean replacement, we are using only one column for imputation. MICE is more robust because it looks for pattern in multiple columns
    • m = 2 is number of imputed sets. It shows up as “imp” column 1 2 in the output. It does not mean that only 2 columns have NA.
    • iter from 1 to 5 is number of iterations
    • Caution: Check if the column is “factor.” A “character” column will not be imputed and it will retain its NA.
    • Question: For a categorical variable does it always give ‘mode’
      • NO, it looks at the pattern based on other variables
# #Convert to Factor before using MICE
bb$brand <- factor(bb$brand)

# #mice() for imputation
# #Including all relevant data i.e. skipping Serial Number only
impute <- mice(bb[ , 2:9], m = 2, seed = 3)
## 
##  iter imp variable
##   1   1  mpg  cylinders  brand
##   1   2  mpg  cylinders  brand
##   2   1  mpg  cylinders  brand
##   2   2  mpg  cylinders  brand
##   3   1  mpg  cylinders  brand
##   3   2  mpg  cylinders  brand
##   4   1  mpg  cylinders  brand
##   4   2  mpg  cylinders  brand
##   5   1  mpg  cylinders  brand
##   5   2  mpg  cylinders  brand
#
print(impute)
## Class: mids
## Number of multiple imputations:  2 
## Imputation methods:
##         mpg   cylinders cubicinches          hp   weightlbs  time.to.60        year       brand 
##       "pmm"       "pmm"          ""          ""          ""          ""          ""   "polyreg" 
## PredictorMatrix:
##             mpg cylinders cubicinches hp weightlbs time.to.60 year brand
## mpg           0         1           1  1         1          1    1     1
## cylinders     1         0           1  1         1          1    1     1
## cubicinches   1         1           0  1         1          1    1     1
## hp            1         1           1  0         1          1    1     1
## weightlbs     1         1           1  1         0          1    1     1
## time.to.60    1         1           1  1         1          0    1     1
#
# #For each iteration we have a different set of imputed data 
# #e.g. for 'mpg' in two sets values are
impute$imp$mpg
##    1  2
## 2 28 14
#
# #NOTE: Original Values that were removed earlier 
tibble(mpg = bb_22, brand = bb_39, cylinders = bb_43)
## # A tibble: 1 x 3
##     mpg brand cylinders
##   <dbl> <chr>     <dbl>
## 1  31.9 US            8
#
# #Complete First Set
set1_bb <- complete(impute, 1)
tibble(mpg = set1_bb$mpg[2], brand = set1_bb$brand[3], cylinders = set1_bb$cylinders[4])
## # A tibble: 1 x 3
##     mpg brand cylinders
##   <dbl> <fct>     <dbl>
## 1    28 US            8
#
# #Complete Second Set
set2_bb <- complete(impute, 2)
tibble(mpg = set2_bb$mpg[2], brand = set2_bb$brand[3], cylinders = set2_bb$cylinders[4])
## # A tibble: 1 x 3
##     mpg brand cylinders
##   <dbl> <fct>     <dbl>
## 1    14 US            8

MICE More

set.seed(3)
ii <- mice(bb[ , 2:9], m = 3)
## 
##  iter imp variable
##   1   1  mpg  cylinders  brand
##   1   2  mpg  cylinders  brand
##   1   3  mpg  cylinders  brand
##   2   1  mpg  cylinders  brand
##   2   2  mpg  cylinders  brand
##   2   3  mpg  cylinders  brand
##   3   1  mpg  cylinders  brand
##   3   2  mpg  cylinders  brand
##   3   3  mpg  cylinders  brand
##   4   1  mpg  cylinders  brand
##   4   2  mpg  cylinders  brand
##   4   3  mpg  cylinders  brand
##   5   1  mpg  cylinders  brand
##   5   2  mpg  cylinders  brand
##   5   3  mpg  cylinders  brand
set.seed(3)
jj <- mice(bb[ , 2:9], m = 3)
## 
##  iter imp variable
##   1   1  mpg  cylinders  brand
##   1   2  mpg  cylinders  brand
##   1   3  mpg  cylinders  brand
##   2   1  mpg  cylinders  brand
##   2   2  mpg  cylinders  brand
##   2   3  mpg  cylinders  brand
##   3   1  mpg  cylinders  brand
##   3   2  mpg  cylinders  brand
##   3   3  mpg  cylinders  brand
##   4   1  mpg  cylinders  brand
##   4   2  mpg  cylinders  brand
##   4   3  mpg  cylinders  brand
##   5   1  mpg  cylinders  brand
##   5   2  mpg  cylinders  brand
##   5   3  mpg  cylinders  brand
#
# #identical() is FALSE but all.equal() is TRUE
identical(ii, jj)
## [1] FALSE
all.equal(ii, jj)
## [1] TRUE
#
# #Similarly
ii <- mice(bb[ , 2:9], m = 3, seed = 3)
## 
##  iter imp variable
##   1   1  mpg  cylinders  brand
##   1   2  mpg  cylinders  brand
##   1   3  mpg  cylinders  brand
##   2   1  mpg  cylinders  brand
##   2   2  mpg  cylinders  brand
##   2   3  mpg  cylinders  brand
##   3   1  mpg  cylinders  brand
##   3   2  mpg  cylinders  brand
##   3   3  mpg  cylinders  brand
##   4   1  mpg  cylinders  brand
##   4   2  mpg  cylinders  brand
##   4   3  mpg  cylinders  brand
##   5   1  mpg  cylinders  brand
##   5   2  mpg  cylinders  brand
##   5   3  mpg  cylinders  brand
jj <- mice(bb[ , 2:9], m = 3, seed = 3)
## 
##  iter imp variable
##   1   1  mpg  cylinders  brand
##   1   2  mpg  cylinders  brand
##   1   3  mpg  cylinders  brand
##   2   1  mpg  cylinders  brand
##   2   2  mpg  cylinders  brand
##   2   3  mpg  cylinders  brand
##   3   1  mpg  cylinders  brand
##   3   2  mpg  cylinders  brand
##   3   3  mpg  cylinders  brand
##   4   1  mpg  cylinders  brand
##   4   2  mpg  cylinders  brand
##   4   3  mpg  cylinders  brand
##   5   1  mpg  cylinders  brand
##   5   2  mpg  cylinders  brand
##   5   3  mpg  cylinders  brand
identical(ii, jj)
## [1] FALSE
all.equal(ii, jj)
## [1] TRUE

Warning Logged Event

(External) MICE Package Author

  • The loggedEvents component of the ‘mids’ object is a data frame with five columns.
    • ‘it’ ‘im’ stand for iteration and imputation number
    • ‘dep’ contains the name of the target variable, and is left blank at initialization
    • ‘meth’ signals the type of problem
      • ‘constant’ : oversized representation of a single value
        • This also comes up if the column is “character” and not converted into “factor”
      • ‘collinear’: The column is duplicate of another column
      • ‘pmm’ : “ForLater” Unknown for now
    • ‘out’ contains the names of the removed variables.
    • In general, strive for zero entries, in which case the loggedEvent component is equal to NULL.
  • Guidance
    • Inspect all complete variables for forgotten missing data marks. Repair or remove these variables. Even one forgotten mark may ruin the imputation model. Remove outliers with improbable values.
    • Obtain insight into the strong and weak parts of the data by studying the influx-outflux pattern. Unless they are scientifically important, remove variables with low outflux, or with high fractions of missing data.
    • Perform a dry run with maxit=0 and inspect the logged events produced by mice(). Remove any constant and collinear variables before imputation.
    • Find out what will happen after the data have been imputed. Determine a set of variables that are important in subsequent analyses, and include these as predictors in all models. Transform variables to improve predictability and coherence in the complete-data model.
    • Run quickpred(), and determine values of mincor and minpuc such that the average number of predictors is around 25.
    • After imputation, determine whether the generated imputations are sensible by comparing them to the observed information, and to knowledge external to the data. Revise the model where needed.
    • Document your actions and decisions, and obtain feedback from the owner of the data.
# #Using the "character" to generate the Warning
ii <- aa
ii$mpg[2] <- ii$brand[3] <- ii$cylinders[4] <- NA
#
tryCatch(expr = {
  jj <- mice(ii[ , 2:9], m = 1, seed = 3)
  }, warning = function(w) {
    print(paste0(w))
  })
## 
##  iter imp variable
##   1   1  mpg  cylinders
##   2   1  mpg  cylinders
##   3   1  mpg  cylinders
##   4   1  mpg  cylinders
##   5   1  mpg  cylinders
## [1] "simpleWarning: Number of logged events: 1\n"
#
# #Warning message: Number of logged events
# #It can occur because of variety of issues in the data
jj$loggedEvents
## NULL

8.6 Outliers

Refer Outliers: C03 and Outliers: B12

  • How do we detect and deal with outliers
    • Use Visualisations for detecting outliers

8.6.1 Histogram

Image

(B16P02 B16P03) Cars: Histogram and Density of Weight (lbs)(B16P02 B16P03) Cars: Histogram and Density of Weight (lbs)

Figure 8.2 (B16P02 B16P03) Cars: Histogram and Density of Weight (lbs)

hist()

# Set up the plot area to visualise multiple 3 plots simultaneously
par(mfrow = c(1, 3))
# Create the histogram bars
hist(aa$weightlbs,
     breaks = 30,
     xlim = c(0, 5000),
     col = "blue",
     border = "black",
     ylim = c(0, 40),
     xlab = "Weight",
     ylab = "Counts",
     main = "Histogram of Car Weights")
# Make a box around # the plot
box(which = "plot",
    lty = "solid",
    col = "black")

Code Histogram

# #Histogram
#bb <- na.omit(xxflights$air_time)
hh <- tibble(ee = aa$weightlbs)
ttl_hh <- "Cars: Histogram of Weight"
cap_hh <- "B16P02"
# #Basics
median_hh <- round(median(hh[[1]]), 1)
mean_hh <- round(mean(hh[[1]]), 1)
sd_hh <- round(sd(hh[[1]]), 1)
len_hh <- nrow(hh)
#
B16 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean_hh), color = '#440154FF') +
  geom_text(data = tibble(x = mean_hh, y = -Inf, 
                          label = paste0("Mean= ", mean_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
  geom_vline(aes(xintercept = median_hh), color = '#3B528BFF') +
  geom_text(data = tibble(x = median_hh, y = -Inf, 
                          label = paste0("Median= ", median_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#3B528BFF', hjust = -0.5, vjust = -0.7, angle = 90) +
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Frequency", 
       subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh, 
                         "; Median= ", median_hh, "; SD= ", sd_hh,
                         ")"), 
        caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B16)
rm(B16)

Code Density

# #Density Curve
ttl_hh <- "Cars: Density Plot of Weight"
cap_hh <- "B16P03"
# #Get Quantiles and Ranges of mean +/- sigma 
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
# #Arrow
arr_y <- 0.0005 #mean(density_hh_tbl$y) #
arr_lst <- list(list("99.7%", xpoints_hh[1], xpoints_hh[7], arr_y),
                list("95.4%", xpoints_hh[2], xpoints_hh[6], arr_y),
                list("68.3%", xpoints_hh[3], xpoints_hh[5], arr_y))
arr_hh <- arr_lst[[1]]
#
# # Latex Labels 
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'), 
                TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'), 
                TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
B16 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_density(alpha = 0.2, colour = "#21908CFF") + 
  geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  #scale_y_continuous(limits = c(0, 0.009), breaks = seq(0, 0.009, 0.003)) +
  scale_y_continuous(labels = function(n){format(n, scientific = FALSE)}) +
  scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) + 
  annotate("segment", x = xpoints_hh[4] - 0.5 * sd_hh, xend = arr_hh[[2]], y = arr_hh[[4]], 
            yend = arr_hh[[4]], arrow = arrow(type = "closed", length = unit(0.02, "npc"))) + 
  annotate("segment", x = xpoints_hh[4] + 0.5 * sd_hh, xend = arr_hh[[3]], y = arr_hh[[4]], 
            yend = arr_hh[[4]], arrow = arrow(type = "closed", length = unit(0.02, "npc"))) + 
  annotate(geom = "text", x = xpoints_hh[4], y = arr_hh[[4]], label = arr_hh[[1]]) + 
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Density", 
       subtitle = paste0("(N=", nrow(.), "; ", "Mean= ", round(mean(.[[1]]), 1), 
                         "; Median= ", round(median(.[[1]]), 1), "; SD= ", round(sd(.[[1]]), 1),
                         ")"), 
        caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B16)
rm(B16)

8.6.2 ScatterPlot

Image

(B16P04 B16P05) Cars: Scatterplot of Weight (x) vs MPG (y) with and without the two outliers(B16P04 B16P05) Cars: Scatterplot of Weight (x) vs MPG (y) with and without the two outliers

Figure 8.3 (B16P04 B16P05) Cars: Scatterplot of Weight (x) vs MPG (y) with and without the two outliers

plot()

# Create a Scatterplot
plot(aa$weightlbs,
     aa$mpg,
     xlim = c(0, 5000),
     ylim = c(0, 600),
     xlab = "Weight",
     ylab = "MPG",
     main = "Scatterplot of MPG by Weight",
     type = "p", #Points
     pch = 16,
     col = "blue")
#Add open black
# circles
points(aa$weightlbs,
       aa$mpg,
       type = "p",
       col = "black")

Code ScatterPlot

# #IN: hh$x, hh$y, ttl_hh, cap_hh, x_hh, y_hh
# #Formula for Trendline calculation
k_gg_formula <- y ~ x
#
B16 <- hh %>% { ggplot(data = ., aes(x = x, y = y)) + 
    geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
    stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")), 
                 formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
                 eq.x.rhs = "~italic(x)", parse = TRUE) +
    geom_vline(aes(xintercept = round(mean(x), 3)), color = '#440154FF', linetype = "dashed") +
    geom_hline(aes(yintercept = round(mean(y), 3)), color = '#440154FF', linetype = "dashed") +
    geom_text(data = tibble(x = mean(.[["x"]]), y = -Inf, 
                            label = TeX(r'($\bar{x}$)', output = "character")), 
              aes(x = x, y = y, label = label), 
              size = 4, color = '#440154FF', hjust = 1.5, vjust = -1, parse = TRUE ) +
    geom_text(data = tibble(x = 0, y = mean(.[["y"]]), 
                            label = TeX(r'($\bar{y}$)', output = "character")), 
              aes(x = x, y = y, label = label), 
              size = 4, color = '#440154FF', hjust = 1.5, vjust = 1.5, parse = TRUE ) +
    geom_point() +
    k_gglayer_scatter +
    labs(x = x_hh, y = y_hh,
        #subtitle = TeX(r"(Trendline Equation, $R^{2}$, $\bar{x}$ and $\bar{y}$)"), 
        caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B16)
rm(B16)

8.6.3 BoxPlot

  • Interquartile (IQR) based approach for identification of Outliers
    • Refer Precentiles for Percentil, Quartile and IQR
    • IQR = Q3 - Q1
    • Any data point not in [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR] is an outlier
  • Question: Why sometimes the y-axis shows range of ~500 and sometimes ~50
    • If the 1 extreme outlier of mpg column is kept then the axis will show upto 500
    • If the point is deleted from the data, then the axis values are in range of 50

Image

(B16P06) Cars: BoxPlot of MPG (excluding 1 point) vs. Cylinders (4, 6, 8)

Figure 8.4 (B16P06) Cars: BoxPlot of MPG (excluding 1 point) vs. Cylinders (4, 6, 8)

boxplot()

boxplot(mpg ~ cyl, data = aa, xlab = "Number of Cylinders",
        ylab = "Miles Per Gallon", main = "Mileage Data")

Code BoxPlot

# #BoxPlot
hh <- aa %>% select(mpg, cylinders) %>% filter(!cylinders %in% c(3, 5)) %>% 
  filter(mpg < max(mpg)) %>% mutate(across(cylinders, factor)) 
#
ttl_hh <- "BoxPlot of MPG (excluding 1 point) vs. Cylinders (4, 6, 8)"
cap_hh <- "B16P06"
x_hh <- "Cylinders"
y_hh <- "MPG"
#
B16 <- hh %>% { ggplot(data = ., mapping = aes(x = cylinders, y = mpg, fill = cylinders)) +
    geom_boxplot(outlier.shape = NA) +
    geom_point(position = position_jitterdodge(jitter.width = 0.1), size = 1, alpha = 0.7) + 
    k_gglayer_box +
    theme(legend.position = 'none') +
    labs(x = x_hh, y = y_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B16)
rm(B16)

8.7 Numerical Methods for detecting Outliers

8.7.1 IQR Based

  • NOTE:
    • In ‘weightlbs’ there is no point outside IQR, thus no data point was eliminated
    • So specifically Cylinder 6 data points were selected for outlier detection and removal

All Values (No Outlier)

bb <- aa 
dim(bb)
## [1] 263   9
#
# #summary() or quantile()
summary(bb$weightlbs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   192.5  2245.5  2830.0  2992.9  3654.5  4997.0
q_bb <- quantile(bb$weightlbs, probs = c(.25, .75), na.rm = TRUE)
q_bb
##    25%    75% 
## 2245.5 3654.5
#
iqr_bb <- IQR(bb$weightlbs)
iqr_bb
## [1] 1409
#
upp_bb <- q_bb[2] + 1.5 * iqr_bb 
low_bb <- q_bb[1] - 1.5 * iqr_bb 
#
kept_bb <- bb[bb$weightlbs >= low_bb & bb$weightlbs <= upp_bb, ]
if(nrow(bb) == nrow(kept_bb)) {
  cat(paste0("No Point was removed because none was outside the range.\n"))
  } else cat(paste0("Number of Points removed = ", nrow(bb) - nrow(kept_bb), "\n"))
## No Point was removed because none was outside the range.

Cylinder = 6 (1 Outlier)

bb <- aa %>% filter(cylinders == 6)
dim(bb)
## [1] 57  9
#
# #summary() or quantile()
summary(bb$weightlbs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   192.5  2910.0  3121.0  3120.4  3415.0  3907.0
q_bb <- quantile(bb$weightlbs, probs = c(.25, .75), na.rm = TRUE)
q_bb
##  25%  75% 
## 2910 3415
#
iqr_bb <- IQR(bb$weightlbs)
iqr_bb
## [1] 505
#
upp_bb <- q_bb[2] + 1.5 * iqr_bb 
low_bb <- q_bb[1] - 1.5 * iqr_bb 
#
kept_bb <- bb[bb$weightlbs >= low_bb & bb$weightlbs <= upp_bb, ]
if(nrow(bb) == nrow(kept_bb)) {
  cat(paste0("No Point was removed because none was outside the range.\n"))
  } else cat(paste0("Number of Points removed = ", nrow(bb) - nrow(kept_bb), "\n"))
## Number of Points removed = 1

Z-score Standardisation

To be continued …

Validation


9 Data Preprocessing (B17, Oct-31)

9.1 Data

Please import the "B16-Cars2.csv"

9.2 Z-score Standardisation

25.20 The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.

  • (25.14) \(z_i = \frac{{x}_i - {\overline{x}}}{{s}}\)
    • \(z_i \notin \pm 3 \to z_i \in \text{Outlier}\)
    • After Standardisation, we can compare values of different orders because these would be scaled into a dimensionless quantity
    • While comparing the Eucledean distance between Two Variables like Age and Income, the standardisation allow these to be scaled to a similar range
      • Ex: If Age range is 20-30 but Income range is 10k-30k, directly using these values will ignore any impact of change in Age.
    • For some data mining algorithms, large differences in the ranges will lead to a tendency for the variable with greater range to have undue influence on the results. Therefore, these numeric variables should be normalised, in order to standardize the scale of effect each variable has on the results.
    • Scaling also benefits Neural networks, and algorithms that make use of distance measures, such as the k-nearest neighbors algorithm.
    • Note that Mean itself includes the impact of extreme values thus it is not very robust. IQR is better because it is based on position
    • Standardisation does not convert the non-normal data to normal. It does not change the shape of the data. Outliers remain outliers, Skewness remain in the data. It simply changes the Scale.
  • Question: Can we use different methods for outlier identification on different variables
    • Yes, you can remove outliers of one variable usig IQR and of another variable using standardisation
  • Question: From the Normalised values, can be convert back to original data
    • Tiring process
    • (Aside) The scales() function attaches mean and sd as attributes to the output matrix. That can be used to convert back the data.
  • scale()
    • It converts any vector to standard normal
  • (Aside)
    • Scaling is less effective if the outliers are present. Extremely low value (e.g. 192.5 Weight) or extremely high value (527 Mileage) are obviously going to impact the scale applied. Probably, it is better to follow: Scaling | Outlier Treatment (Identification, Removal or Imputation) | Scaling of original data with Outlier Treatment

Normalisation

# #Normalising Weight
bb <- aa %>% select(weightlbs) %>% mutate(z = as.vector(scale(weightlbs)))
str(bb)
## tibble [263 x 2] (S3: tbl_df/tbl/data.frame)
##  $ weightlbs: num [1:263] 4209 1925 3449 3761 2051 ...
##  $ z        : num [1:263] 1.402 -1.231 0.526 0.885 -1.086 ...
#
# #Excluding Outliers
kept_bb <- bb[bb$z >= -3 & bb$z <= 3, ]
str(kept_bb)
## tibble [262 x 2] (S3: tbl_df/tbl/data.frame)
##  $ weightlbs: num [1:262] 4209 1925 3449 3761 2051 ...
##  $ z        : num [1:262] 1.402 -1.231 0.526 0.885 -1.086 ...
#
# #Similarly with mpg
kept_bb <- aa %>% select(mpg) %>% mutate(z = as.vector(scale(mpg))) %>% filter(z >= -3 & z <= 3)
str(kept_bb)
## tibble [262 x 2] (S3: tbl_df/tbl/data.frame)
##  $ mpg: num [1:262] 14 31.9 17 15 30.5 23 13 14 25.4 37.7 ...
##  $ z  : num [1:262] -0.346 0.213 -0.252 -0.314 0.17 ...
summary(kept_bb)
##       mpg              z           
##  Min.   :10.00   Min.   :-0.47040  
##  1st Qu.:16.93   1st Qu.:-0.25421  
##  Median :22.00   Median :-0.09577  
##  Mean   :23.15   Mean   :-0.05981  
##  3rd Qu.:28.70   3rd Qu.: 0.11340  
##  Max.   :46.60   Max.   : 0.67222

scale()

# #scale(x, center = TRUE, scale = TRUE) output is Nx1 Matrix
bb <- aa %>% select(weightlbs) 
ii <- bb %>% mutate(z = as.vector(scale(weightlbs)))
#bb %>% mutate(z = across(weightlbs, scale)) #matrix
#bb %>% mutate(z = across(weightlbs, ~ as.vector(scale(.)))) #tibble
jj <- bb %>% mutate(across(weightlbs, list(z = ~ as.vector(scale(.))), .names = "{.fn}"))
kk <- bb
kk$z <- as.vector(scale(kk$weightlbs))
stopifnot(all(identical(ii, jj), identical(ii, kk)))

9.3 Min-Max Scaling

  • \(x^* = \frac{{x}_i - \text{min}(x)}{\text{range}(x)} = \frac{{x}_i - \text{min}(x)}{\text{max}(x) - \text{min}(x)} \to x^* \in [0, 1]\)
    • This is for scaling only, not for removal of outliers
# #Min-Max Scaling
min_aa <- min(aa$weightlbs)
max_aa <- max(aa$weightlbs)
bb <- aa %>% select(weightlbs) %>% mutate(z = {weightlbs - min_aa}/{max_aa - min_aa})
str(bb)
## tibble [263 x 2] (S3: tbl_df/tbl/data.frame)
##  $ weightlbs: num [1:263] 4209 1925 3449 3761 2051 ...
##  $ z        : num [1:263] 0.836 0.361 0.678 0.743 0.387 ...

9.4 Decimal Scaling

  • \(x^* = \frac{{x}_i}{10^d} \to x^* \in [-1, 1]\)
    • Where ‘d’ represents the number of digits in the largest absolute value i.e. if max(abs(x)) is 6997, d will be 4
# #Count Digits in Maximum (NOTE: Take care of NA, 0, [-1, 1] values)
d_bb <- 10^{floor(log10(max(abs(aa$weightlbs)))) + 1}
# #Decimal Scaling
bb <- aa %>% select(weightlbs) %>% mutate(z = weightlbs/d_bb)
str(bb)
## tibble [263 x 2] (S3: tbl_df/tbl/data.frame)
##  $ weightlbs: num [1:263] 4209 1925 3449 3761 2051 ...
##  $ z        : num [1:263] 0.421 0.192 0.345 0.376 0.205 ...

9.5 Comparison

Histogram

(B16P02 B17P01) Cars: Histogram of Weight (Original vs Scaled)(B16P02 B17P01) Cars: Histogram of Weight (Original vs Scaled)

Figure 9.1 (B16P02 B17P01) Cars: Histogram of Weight (Original vs Scaled)

hist()

par(mfrow = c(1,2))
# Create two histograms
hist(bb$weightlbs, breaks = 20,
     xlim = c(1000, 5000),
     main = "Histogram of Weight",
     xlab = "Weight",
     ylab = "Counts")
box(which = "plot",
    lty = "solid",
    col = "black")
#
hist(bb$z,
     breaks = 20,
     xlim = c(-2, 3),
     main = "Histogram of Zscore
of Weight",
     xlab = "Z-score of Weight",
     ylab = "Counts")
box(which = "plot",
    lty = "solid",
    col = "black")

9.6 Skewness

  • Refer Skewness
  • Scaling does not change the skewness
# #Skewness
bb <- aa %>% select(weightlbs) %>% mutate(z = as.vector(scale(weightlbs)))
ii <- bb$weightlbs
#
3 * {mean(ii) - median(ii)} / sd(ii)
## [1] 0.5632797
#
ii <- bb$z
3 * {mean(ii) - median(ii)} / sd(ii)
## [1] 0.5632797

9.7 Non-linear Transformations

  • It is done for conversion of non-normal data to normal
    • Note that scaling is linear transformation
  • Transformations
    • Sqare Root - sqrt()
    • Log - log(), log10()
    • Inverse Sqare Root
bb <- aa %>% select(weightlbs) %>% 
  mutate(z = as.vector(scale(weightlbs)), Sqrt = sqrt(weightlbs),
         Log = log(weightlbs), InvSqr = 1/Sqrt)
#
# #Check Skewness
vapply(bb, function(x) round(3 * {mean(x) - median(x)} / sd(x), 3), numeric(1))
## weightlbs         z      Sqrt       Log    InvSqr 
##     0.563     0.563     0.339     0.086     0.145

Histogram

(B17P03) Cars: Weight Transformed with Original & Scaled

Figure 9.2 (B17P03) Cars: Weight Transformed with Original & Scaled

facet_wrap()

# #Histogram
bb <- aa %>% select(weightlbs) %>% 
  mutate(z = as.vector(scale(weightlbs)), Sqrt = sqrt(weightlbs), 
         Log = log(weightlbs), InvSqr = 1/Sqrt) %>% 
  pivot_longer(everything(), names_to = "Key", values_to = "Values") %>% 
  mutate(across(Key, factor, levels = c("Sqrt", "Log", "InvSqr", "weightlbs", "z"), 
    labels = c("Square Root", "Natural Log", "Inverse Square", 
               "Original Weight", "Scaled Weight")))
#
hh <- bb
mean_hh <- hh %>% group_by(Key) %>% summarize(Mean = mean(Values))
#
ttl_hh <- "Cars: Weight with Transformed values and Mean"
cap_hh <- "B17P03"
#
B17 <- hh %>% { ggplot(data = ., mapping = aes(Values)) + 
    geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
    geom_vline(data = mean_hh, aes(xintercept = Mean), color = '#440154FF') +
      geom_text(data = mean_hh, aes(x = Mean, y = -Inf, label = paste0("Mean= ", f_pNum(Mean))), 
            color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
    facet_wrap(~Key, scales = 'free_x') +
    theme(plot.title.position = "panel") + 
    labs(x = "x", y = "Frequency", caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B17)
rm(B17)

f_pNum()

f_pNum <- function(x, digits = 2L) {
  # #Print Numbers
  # #round(), rounds to a number of decimal places
  # #signif() rounds to a specific number of significant places
  if(FALSE){#Test Case: Why this function is needed
     round(0.001, 2) #0
    f_pNum(0.001, 2) #0.001
    # #Problem is signif converts to SCientific and no way to disable it
    signif(0.000000198, 2) #2e-07
    # #Same problem with round
    round(0.000000198, 7) #2e-07
  }
  # #if(){} else if(){} else{} is NOT vectorised
  #ifelse(abs(x) < 0.0000001, 0*sign(x), ifelse(abs(x) > 1, round(x, digits), signif(x, digits + 1L)))
  ifelse(abs(x) < 0.0000001, 0*sign(x), floor(x) + signif(x %% 1, digits))
}

hist()

# #Histogram with Normal Distribution Overlay
par(mfrow=c(1,1))
hist(bb$InvSqr,
     breaks = 30,
     xlim=c(0.0125, 0.0275),
     col = "lightblue",
     prob = TRUE,
     border = "black",
     xlab="Inverse Square Root of Weight",
     ylab = "Counts",
     main = "Histogram of Inverse Square Root of Weight")
box(which = "plot",
    lty = "solid",
    col="black")
# #Overlay Normal density
lines(density(bb$InvSqr), col="red")

9.8 QQ Plot

QQ (quantile-quantile) plot is a probability plot for comparing two probability distributions by plotting their quantiles against each other. A point \((x, y)\) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate). If the two distributions being compared are similar, the points in the QQ plot will approximately lie on the line \(y = x\).

  • Refer figure 4.6
    • The QQ Plot shows whether the values are within the specified limits of Normal Curve

Image

(B17P04) Cars: QQ Plots of Transformed Weight

Figure 9.3 (B17P04) Cars: QQ Plots of Transformed Weight

Flipped Axis

(B17P05) Cars: QQ Plots of Transformed Weight

Figure 9.4 (B17P05) Cars: QQ Plots of Transformed Weight

Code

# #QQ Plot
bb <- aa %>% select(weightlbs) %>% 
  filter(weightlbs > min(weightlbs)) %>% 
  mutate(z = as.vector(scale(weightlbs)), Sqrt = sqrt(weightlbs), 
         Log = log(weightlbs), InvSqr = 1/Sqrt) %>% 
  pivot_longer(everything(), names_to = "Key", values_to = "Values") %>% 
  mutate(across(Key, factor, levels = c("Sqrt", "Log", "InvSqr", "weightlbs", "z"), 
    labels = c("Square Root", "Natural Log", "Inverse Square", 
               "Original Weight", "Scaled Weight")))
#
hh <- bb
#hh %>% group_by(Key) %>% summarize(Max = max(Values), Min = min(Values))
max_hh <- min_hh <- hh %>% group_by(Key) %>% summarise(Values = 0)
#
# #Modify Number of Y-Axis Major Gridlines for Horizontal Comparison
max_hh$Values  <- c(100, 8.55, 0.0300, 5000, 2.35) #c(72, 8.55, 0.0255, 5000, 2.35)
min_hh$Values  <- c(20, 7.35, 0.0135, 1500, -1.65) #c(40, 7.35, 0.0135, 1500, -1.65)
#
ttl_hh <- "QQ Plots of Transformed Weight"
sub_hh <- "Excluded 1 Outlier and Modified Y-axis for alignment"
cap_hh <- "B17P04"
#
B17 <- hh %>% { ggplot(., aes(sample = Values)) +
    stat_qq() +
    stat_qq_line() +
    geom_blank(data=max_hh, aes(y = Values)) +
    geom_blank(data=min_hh, aes(y = Values)) +
    facet_wrap(~Key, scales = 'free') +
    scale_x_continuous(limits = c(-3, 3)) + 
    #coord_flip() +
    labs(caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, B17)
rm(B17)

qqnorm()

# Normal Q-Q Plot
qqnorm(bb$InvSqr,
       datax = TRUE,
       col = "red",
       ylim = c(0.01, 0.03),
       main = "Normal
Q-Q Plot of Inverse Square Root of Weight")
qqline(bb$InvSqr,
       col = "blue",
       datax = TRUE)

9.9 Shapiro

32.5 The Shapiro-Wilk test is a test of normality. It tests the null hypothesis that a sample came from a normally distributed population. \(P_{\text{shapiro}} > ({\alpha} = 0.05) \to \text{Data is Normal}\). Avoid using sample with more than 5000 observations.

9.9.1 rnorm()

  • rnorm(n, mean = 0, sd = 1)
    • Create Vector of Random Numbers with given mean and sd
set.seed(3)
ii <- rnorm(n = 100, mean = 50, sd = 5.99)
#
# #Check Normality of randomly generated Normal dataset
shapiro.test(ii)
## 
##  Shapiro-Wilk normality test
## 
## data:  ii
## W = 0.97928, p-value = 0.1167
#
# #Check Normality of Weight
ii <- aa %>% select(weightlbs) %>% 
  #filter(weightlbs > min(weightlbs)) %>%
  mutate(z = as.vector(scale(weightlbs)), Sqrt = sqrt(weightlbs), 
         Log = log(weightlbs), InvSqr = 1/Sqrt) %>% 
  pivot_longer(everything(), names_to = "Key", values_to = "Values") %>% 
  mutate(across(Key, factor, levels = c("Sqrt", "Log", "InvSqr", "weightlbs", "z"), 
    labels = c("Square Root", "Natural Log", "Inverse Square", 
               "Original Weight", "Scaled Weight")))
#
# #No Transformation was able to convert the data to Normality 
# #Even after excluding 1 outlier (Not shown here)
ii %>% group_by(Key) %>% 
  summarise(p_Shapiro = shapiro.test(Values)$p.value, 
            isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 5 x 3
##   Key             p_Shapiro isNormal
##   <fct>               <dbl> <lgl>   
## 1 Square Root      2.14e- 7 FALSE   
## 2 Natural Log      1.45e-14 FALSE   
## 3 Inverse Square   4.03e-25 FALSE   
## 4 Original Weight  6.81e- 7 FALSE   
## 5 Scaled Weight    6.81e- 7 FALSE

9.9.2 cut()

# #Continuous to Categorical (Bins)
cut_ii <- cut(aa$weightlbs, breaks = 3, dig.lab = 4, include.lowest = TRUE, ordered_result = TRUE)
levels(cut_ii)
## [1] "[187.7,1794]" "(1794,3396]"  "(3396,5002]"
#
# #ggplot2::cut_interval()
cut_jj <- cut_interval(aa$weightlbs, n = 3, dig.lab = 4, ordered_result = TRUE)
levels(cut_jj)
## [1] "[192.5,1794]" "(1794,3396]"  "(3396,4997]"
#
# #With Labels: NOTE default ordering is ascending
levels(cut(aa$weightlbs, breaks = 3, dig.lab = 4, include.lowest = TRUE, ordered_result = TRUE, 
           labels = c("low", "medium", "high")))
## [1] "low"    "medium" "high"
levels(cut_interval(aa$weightlbs, n = 3, dig.lab = 4, ordered_result = TRUE, 
                    labels = c("low", "medium", "high")))
## [1] "low"    "medium" "high"

9.10 Continuos to Categorical Groups

bb <- aa %>% select(weightlbs) %>% rename(Weight = 1)
#
# #Subsetting
# #Create Column explicitly to prevent Warning message: Unknown or uninitialised column: `ii`. 
bb$ii <- NA
bb$ii[bb$Weight >= 3000] <- 1
bb$ii[bb$Weight < 3000] <- 2
#
# #Using ifelse() or case_when()
bb <- bb %>% mutate(jj = ifelse(Weight >= 3000, 1, 2), 
                    kk = case_when(Weight >= 3000 ~ 1, Weight < 3000 ~ 2))
stopifnot(all(identical(bb$ii, bb$jj), identical(bb$ii, bb$kk)))

9.11 Index

# #Create Data
set.seed(3)
bb <- tibble(x = rnorm(n = 10, mean = 5, sd = 0.55), 
             y = rnorm(n = 10, mean = 4.5, sd = 0.66))
#
# #Basic Indexing
bb$i <- 1:nrow(bb)
# #Indexcan be started from anywhere. However it is not recommended.
bb$j <- 5:{nrow(bb) + 5L - 1L}
# 
# #Other Methods
bb$k <- seq_along(bb[[1]])
bb$l <- seq_len(nrow(bb))
bb$m <- seq.int(nrow(bb))
# #Note the placement of column at the beginning i.e. column index modified
bb <- cbind(n = 1:nrow(bb), bb)
bb <- rowid_to_column(bb, "o")
#
bb <- bb %>% mutate(p = row_number())
#
# #Excluding 'j' all other columns are equal. However, 'n' & 'o' modify column index
stopifnot(all(identical(bb$i, bb$k), identical(bb$i, bb$k), identical(bb$i, bb$l), 
  identical(bb$i, bb$m), identical(bb$i, bb$n), identical(bb$i, bb$o), identical(bb$i, bb$p)))

Validation


10 Unsupervised Learning (B18, Nov-07)

10.1 Overview

  • “Unsupervised Learning Algorithms”
    • “ForLater”
    • Case Analysis of JAT is Merged in notes of Case Study: JAT
    • NOTE: Discussion about Jupyter Notebook & Anaconda Navigator “15:45 - 16:05” is NOT covered because I am not working with it.
    • NOTE: Package ‘esquisse’ was not used because interactive is difficult to show in document format. “16:15 - 16:30”

10.2 Install

if(FALSE){# #WARNING: Installation may take some time.
  install.packages("esquisse", dependencies = TRUE)
}

10.3 Data: Churn

Please import the "B18-Churn.xlsx"

10.4 Q1

  • Explore whether there are missing values for any of the variables.
    • There are NO missing values

Transposed Churn

Table 10.1: (B18T02) Churn Transposed
Col_Row Row_1 Row_2 Row_3 Row_4 Row_5 Row_6
State KS OH NJ OH OK AL
Account Length 128 107 137 84 75 118
Area Code 415 415 415 408 415 510
Phone 382-4657 371-7191 358-1921 375-9999 330-6626 391-8027
Int’l Plan no no no yes yes yes
VMail Plan yes yes no no no no
VMail Message 25 26 0 0 0 0
Day Mins 265.1 161.6 243.4 299.4 166.7 223.4
Day Calls 110 123 114 71 113 98
Day Charge 45.07 27.47 41.38 50.9 28.34 37.98
Eve Mins 197.4 195.5 121.2 61.9 148.3 220.6
Eve Calls 99 103 110 88 122 101
Eve Charge 16.78 16.62 10.3 5.26 12.61 18.75
Night Mins 244.7 254.4 162.6 196.9 186.9 203.9
Night Calls 91 103 104 89 121 118
Night Charge 11.01 11.45 7.32 8.86 8.41 9.18
Intl Mins 10 13.7 12.2 6.6 10.1 6.3
Intl Calls 3 3 5 7 3 6
Intl Charge 2.7 3.7 3.29 1.78 2.73 1.7
CustServ Calls 1 1 0 2 3 0
Churn False. False. False. False. False. False.

Churn Data

Table 10.2: (B18T01) Churn
State Account Length Area Code Phone Int’l Plan VMail Plan VMail Message Day Mins Day Calls Day Charge Eve Mins Eve Calls Eve Charge Night Mins Night Calls Night Charge Intl Mins Intl Calls Intl Charge CustServ Calls Churn
KS 128 415 382-4657 no yes 25 265.1 110 45.07 197.4 99 16.78 244.7 91 11.01 10.0 3 2.70 1 False.
OH 107 415 371-7191 no yes 26 161.6 123 27.47 195.5 103 16.62 254.4 103 11.45 13.7 3 3.70 1 False.
NJ 137 415 358-1921 no no 0 243.4 114 41.38 121.2 110 10.30 162.6 104 7.32 12.2 5 3.29 0 False.
OH 84 408 375-9999 yes no 0 299.4 71 50.90 61.9 88 5.26 196.9 89 8.86 6.6 7 1.78 2 False.
OK 75 415 330-6626 yes no 0 166.7 113 28.34 148.3 122 12.61 186.9 121 8.41 10.1 3 2.73 3 False.
AL 118 510 391-8027 yes no 0 223.4 98 37.98 220.6 101 18.75 203.9 118 9.18 6.3 6 1.70 0 False.

NA

anyNA(bb)
## [1] FALSE

10.5 Q2

  • Compare the area code and state fields. Discuss any apparent abnormalities.
    • There are only 3 Area Codes and all of them belong to California State.

States Bar Chart

(B18P01) Churn: States Frequency

Figure 10.1 (B18P01) Churn: States Frequency

Exploration

# #Select | Rename 
bb <- aa %>% select(`Area Code`, State) %>% rename(Area = "Area Code") 
# #Select | Group | Frequency | Descending   
ii <- bb %>% select(State) %>% group_by(State) %>% summarise(CNT = n()) %>% arrange(desc(CNT)) %>% 
     mutate(across(State, factor, levels = rev(unique(State)), ordered = TRUE))

State & Area

ii <- bb %>% mutate(across(everything(), factor))
#
# #Unique Values
ii %>% summarise(across(everything(), ~ length(unique(.))))
## # A tibble: 1 x 2
##    Area State
##   <int> <int>
## 1     3    51
#
summary(ii)
##   Area          State     
##  408: 838   WV     : 106  
##  415:1655   MN     :  84  
##  510: 840   NY     :  83  
##             AL     :  80  
##             OH     :  78  
##             OR     :  78  
##             (Other):2824
#
str(levels(ii$Area))
##  chr [1:3] "408" "415" "510"
str(levels(ii$State))
##  chr [1:51] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN" ...

Code Bar Chart

# #Proper Sorting of Factors for Flipped Axes
hh <- ii %>% mutate(nState = as.integer(State))
# #Because the CNT have duplicated values ggplot would add them up if used on x-axis
anyDuplicated(ii$CNT)
# #So, place it on Y-axis and then flip the axis
#
# #Set Alternate Labels as blanks on both Primary and Secondary x-axis
#x_sec <- x_prim <- as.character(hh$State)
#x_prim[1:nrow(hh) %%2 != 1] <- ""
#x_sec[1:nrow(hh) %%2 == 1] <- ""
#
# #Get Median Location
#hh %>% filter(CNT == median(CNT)) %>% mutate(as.integer(State))
median_loc_hh <- ceiling(nrow(hh)/2) 
#
cap_hh <- "B18P01"
ttl_hh <- "Churn: Frequency of States"
sub_hh <- paste0(nrow(hh), " States with Median = ",  median(hh$CNT)) #NULL
#
B18 <- hh %>% { ggplot(data = ., aes(x = nState, y = CNT)) +
    geom_bar(stat = 'identity', aes(fill = (nState %% 2 == 0))) + 
    geom_vline(aes(xintercept = median_loc_hh), color = '#440154FF') +
    scale_x_continuous( #sec.axis = sec_axis(~., breaks = 1:nrow(.), labels = rev(.$State)), 
      breaks = 1:nrow(.), guide = guide_axis(n.dodge = 2), labels = rev(.$State)) + 
    k_gglayer_bar + 
    coord_flip() +
    labs(x = "State", y = "Frequency", subtitle = sub_hh, 
         caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)

10.6 Q3

  • Use useful graphs to visually determine whether there are any outliers in the datasets (Note: check the same for all the numeric variables).
    • Histograms
    • QQ Plots
    • Box Plots

10.6.1 All Histograms

Image

(B18P03) Churn: All Histograms

Figure 10.2 (B18P03) Churn: All Histograms

Code

ii <- bb %>% 
  select(where(is.numeric)) %>% 
  select(!area_code) %>% 
  relocate(ends_with("_mins")) %>% 
  relocate(ends_with("_calls")) %>% 
  relocate(vmail_message, .after =  last_col()) %>% 
  pivot_longer(everything(), names_to = "Key", values_to = "Values") %>% 
  mutate(across(Key, ~ factor(., levels = unique(Key))))
#
str(ii)
# #Histogram
hh <- ii
ttl_hh <- "Churn: Histograms"
cap_hh <- "B18P03"
#
B18 <- hh %>% { ggplot(data = ., mapping = aes(x = Values)) + 
    geom_histogram(bins = ifelse(length(unique(.[[1]])) > 50, 50, length(unique(.[[1]]))),
                   alpha = 0.4, fill = '#FDE725FF') + 
    theme(plot.title.position = "panel", 
          strip.text.x = element_text(size = 10, colour = "white")) +
    facet_wrap(~Key, nrow = 3, scales = 'free') +
    labs(x = "x", y = NULL, caption = cap_hh, subtitle = NULL, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)

10.6.2 All QQ Plots

Image

(B18P04) Churn: All QQ Plots

Figure 10.3 (B18P04) Churn: All QQ Plots

Code

# #QQ Plots
hh <- ii
ttl_hh <- "Churn: QQ Plots"
cap_hh <- "B18P04"
#
B18 <- hh %>% { ggplot(., aes(sample = Values)) +
    stat_qq() +
    stat_qq_line() +
    facet_wrap(~Key, nrow = 3, scales = 'free_y') +
    #scale_x_continuous(limits = c(-3, 3)) + 
    #coord_flip() +
    theme(plot.title.position = "panel", 
          strip.text.x = element_text(size = 10, colour = "white")) +
    labs(x = "x", y = NULL, caption = cap_hh, subtitle = NULL, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)

10.6.3 9 Box Plots

Image

(B18P05 B18P06 B18P07) Churn: BoxPlots of Calls, Minutes, & Charges(B18P05 B18P06 B18P07) Churn: BoxPlots of Calls, Minutes, & Charges(B18P05 B18P06 B18P07) Churn: BoxPlots of Calls, Minutes, & Charges

Figure 10.4 (B18P05 B18P06 B18P07) Churn: BoxPlots of Calls, Minutes, & Charges

Code

# #BoxPlot
B18 <- hh %>% { ggplot(data = ., mapping = aes(x = Key, y = Values, fill = Key)) +
    geom_boxplot() +
    k_gglayer_box +
    theme(legend.position = 'none') +
    labs(x = NULL, y = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)

10.6.4 International Calls Box Plots

Image

(B18P08 B18P09 B18P10) Churn: BoxPlots of International Calls, Minutes, & Charges(B18P08 B18P09 B18P10) Churn: BoxPlots of International Calls, Minutes, & Charges(B18P08 B18P09 B18P10) Churn: BoxPlots of International Calls, Minutes, & Charges

Figure 10.5 (B18P08 B18P09 B18P10) Churn: BoxPlots of International Calls, Minutes, & Charges

Code

# #BoxPlot
B18 <- hh %>% { ggplot(data = ., mapping = aes(y = Values)) +
    geom_boxplot() +
    k_gglayer_box +
    theme(legend.position = 'none') +
    labs(x = NULL, y = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)

10.6.5 3 Box Plots

(B18P11 B18P12 B18P13) Churn: BoxPlots of Reamining Three(B18P11 B18P12 B18P13) Churn: BoxPlots of Reamining Three(B18P11 B18P12 B18P13) Churn: BoxPlots of Reamining Three

Figure 10.6 (B18P11 B18P12 B18P13) Churn: BoxPlots of Reamining Three

10.6.6 International Calls

Image

(B18P02) Churn: International Calls

Figure 10.7 (B18P02) Churn: International Calls

Exploration

# #Rename to Proper Names | To Lower, Replace by Underscore | Coercion 
bb <- aa %>% rename_with(make.names) %>% 
  rename_with(~ tolower(gsub(".", "_", .x, fixed = TRUE))) %>% 
  mutate(across(c(int_l_plan, vmail_plan), ~case_when(. == "yes" ~ TRUE, . == "no" ~ FALSE))) %>% 
  mutate(across(churn, ~case_when(. == "True." ~ TRUE, . == "False." ~ FALSE))) %>% 
  mutate(across(ends_with("_calls"), as.integer))
#t(bb %>% summarise(across(everything(), ~length(unique(.)))))
#str(bb)
#summary(bb)

Code

# #Histogram
hh <- tibble(ee = bb$intl_calls)
ttl_hh <- "Churn: Histogram of International Calls"
cap_hh <- "B18P02"
# #Bins
summary(hh[[1]])
bins_hh <- ifelse(length(unique(hh[[1]])) > 50, 50, length(unique(hh[[1]])))
# #Basics
median_hh <- round(median(hh[[1]]), 1)
mean_hh <- round(mean(hh[[1]]), 1)
sd_hh <- round(sd(hh[[1]]), 1)
len_hh <- nrow(hh)
#
B18 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_histogram(bins = bins_hh, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean_hh), color = '#440154FF') +
  geom_text(data = tibble(x = mean_hh, y = -Inf, 
                          label = paste0("Mean= ", mean_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
  geom_vline(aes(xintercept = median_hh), color = '#3B528BFF') +
  geom_text(data = tibble(x = median_hh, y = -Inf, 
                          label = paste0("Median= ", median_hh)), 
            aes(x = x, y = y, label = label), 
            color = '#3B528BFF', hjust = -0.5, vjust = -0.7, angle = 90) +
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Frequency", 
       subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh, 
                         "; Median= ", median_hh, "; SD= ", sd_hh,
                         ")"), 
        caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)

10.6.7 Scale

Histograms

(B18P14) Churn: All Histograms (Scaled)

Figure 10.8 (B18P14) Churn: All Histograms (Scaled)

QQ Plots

(B18P15) Churn: All QQ Plots (Scaled)

Figure 10.9 (B18P15) Churn: All QQ Plots (Scaled)

BoxPlots

(B18P16) Churn: All Box Plots (Scaled)

Figure 10.10 (B18P16) Churn: All Box Plots (Scaled)

10.6.8 Overlaid Histograms

(B18P17) Churn: All Histograms

Figure 10.11 (B18P17) Churn: All Histograms

10.7 Q4

  • Identify the outliers, using:
  • The Z-score method \(z \notin [-3, +3] \to \text{Outlier}\)
  • The IQR method
  • Shown Above

Validation


11 Supervised Learning (B19, Nov-14)

11.1 Overview

  • “Supervised Learning Algorithm: Cluster Analysis”

45.1 In unsupervised methods, no target variable is identified as such. Instead, the data mining algorithm searches for patterns and structures among all the variables. The most common unsupervised data mining method is clustering. Ex: Voter Profile.

45.2 Supervised methods are those in which there is a particular prespecified target variable and the algorithm is given many examples where the value of the target variable is provided. This allows the algorithm to learn which values of the target variable are associated with which values of the predictor variables.

  • Data Mining Methods and Definitions
    • Data mining methods may be categorized as either supervised or unsupervised.
    • Most data mining methods are supervised methods.
    • Unsupervised : Clustering, PCA, Factor Analysis, Association Rules, RFM
    • Supervised :
      • Regression (Continuous Target) : Linear Regression, Regularised Regression, Decision trees, Ensemble learning
        • Linear Regression : Ridge, Lasso and Elastic Regression
        • Ensemble learning : Bagging, Boosting (AdaBoost, XGBoost), Random forests
      • Classification (Categorical Target) : Decision trees, Ensemble learning, Logistic Regression, k-nearest neighbor (k-NN), Naive-Bayes
      • Deep Learning : Neural Networks

11.2 RFM Analysis

  • RFM - Recency, Frequency, and Monetary value
    • It is for customer segmentation
    • Recency -
      • Freshness of customer activity (purchase /visit)
    • Frequency -
      • Total number of transactions in a given period
    • Monetary value -
      • Total or Average Transaction value
    • RFM Score is calculated and each parameter is assgined a weightage and based on that all the customers are classified
      • R /F /M Score - Rank them and then score of [1, 5] is allocated
      • Then we can define rules like
        • 125 Rules
        • Anyone having R[3, 5], F[4,5], M[4,5] is a very important customer for us
  • (Aside) Caution:
    • RFM ignores other clusters e.g. gender
    • It ignores seasonal /cyclical trends
    • It does not look at the duration of customer engagement i.e. A has done 10 transactions in 10 Years, B has done 9 transactions in 10 Months.
  • Question: Can the scoring be different from [1, 5] in R
    • Not in R
    • (Aside) Defaults can be modified
  • Question: Can we have fewer than 5 ratings in R e.g. [1, 4]
    • Not in R
    • (Aside) Defaults can be modified
  • Question: What if there are outliers. If Max Recency is 100 while all the other values are 1-40, would we still assign same Rank
    • Yes
    • This is percentile based segregation which is least affected by the outliers
  • Question: Can we delete that outlier
    • No, we cannot
    • If all customers are purchasing only upto 5000 and one customer is spending 10000, can we afford to overlook that customer
  • Question: Does the score is affected by Business Context. i.e. a recency value of 100 might be outlier for one business but might not be for another
    • No,
    • Business context affects the weights assgined to them. However, that is part of later analysis.
  • Question: For Recency, will there be a maximum limit that we can consider
    • No
    • Most companies do this analysis every quarter and that can be used as a benchmark for next quarter
    • Further, a consumer durable company would not look at it every quarter. They would be looking at broader horizon.
  • Question: Do we look at the item-wise details
    • No, that is the limitation of RFM
    • We are looking at the Store level (total amount) not the item-wise details. For that we need the ‘Market Basket’ analysis

11.3 Packages

if(FALSE){# #WARNING: Installation may take some time.
  install.packages("rfm", dependencies = TRUE)
  install.packages("lubridate", dependencies = TRUE)
}

11.4 Data RFM

Please import the "B19-Transaction.csv"

11.5 Transaction vs Customer Level Data

  • In the Transaction level data, customer id can be repeated. It is mainly based on each transaction
  • In the Customer level data, each row represents a unique customer with no duplicates. It is summarised view of all transactions

11.6 RFM on Transaction

  • rfm_table_order()
    • Transform Transaction level data into Customer level data
    • If Scores are Recency = 3, Frequency = 4, Monetary = 3. Then it report the RFM Score as ‘343’
    • We can supply bins for all 3 of RFM. See Function Example. These bins are available as attributes of outcome for reference

Run RFM

# #character to date using dmy() #wwww
bb <- aa
str(bb)
## spec_tbl_df [4,906 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ customer_id: chr [1:4906] "Mr. Brion Stark Sr." "Ethyl Botsford" "Hosteen Jacobi" "Mr. Edw Frami" ...
##  $ order_date : chr [1:4906] "20-12-2004" "02-05-2005" "06-03-2004" "15-03-2006" ...
##  $ revenue    : num [1:4906] 32 36 116 99 76 56 108 183 30 13 ...
bb$order_date <- dmy(bb$order_date)
#
str(bb)
## spec_tbl_df [4,906 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ customer_id: chr [1:4906] "Mr. Brion Stark Sr." "Ethyl Botsford" "Hosteen Jacobi" "Mr. Edw Frami" ...
##  $ order_date : Date[1:4906], format: "2004-12-20" "2005-05-02" "2004-03-06" ...
##  $ revenue    : num [1:4906] 32 36 116 99 76 56 108 183 30 13 ...
anyNA(bb)
## [1] FALSE
summary(bb)
##  customer_id          order_date            revenue      
##  Length:4906        Min.   :2001-10-29   Min.   : 10.00  
##  Class :character   1st Qu.:2004-10-07   1st Qu.: 45.00  
##  Mode  :character   Median :2005-06-08   Median : 81.00  
##                     Mean   :2005-05-29   Mean   : 94.61  
##                     3rd Qu.:2005-11-08   3rd Qu.:137.00  
##                     Max.   :2006-12-30   Max.   :219.00
#
# #Get Analysis Date as the Next Date after the Max Date in the Data
analysis_date <- max(bb$order_date) + 1 #as_date("2006-12-31")
#
# #RFM analysis by rfm_table_order()
rfm_result <- rfm_table_order(bb, customer_id = customer_id, order_date = order_date, 
                              revenue = revenue, analysis_date = analysis_date)
# #Output is a Tibble with some other attributes
loc_src <- paste0(.z$XL, "B19-Transaction-RFM.csv")
# #Save the Result in a CSV
if(FALSE) write_csv(rfm_result$rfm, file = loc_src)

Bins

# #Bins of RFM
str(rfm_result$rfm)
## tibble [995 x 9] (S3: tbl_df/tbl/data.frame)
##  $ customer_id      : chr [1:995] "Abbey O'Reilly DVM" "Add Senger" "Aden Lesch Sr." "Admiral Senger" ...
##  $ date_most_recent : Date[1:995], format: "2006-06-09" "2006-08-13" "2006-06-20" ...
##  $ recency_days     : num [1:995] 205 140 194 132 90 84 281 246 349 619 ...
##  $ transaction_count: num [1:995] 6 3 4 5 9 9 8 4 3 4 ...
##  $ amount           : num [1:995] 472 340 405 448 843 763 699 157 363 196 ...
##  $ recency_score    : int [1:995] 3 4 3 4 5 5 3 3 2 1 ...
##  $ frequency_score  : int [1:995] 4 1 2 3 5 5 5 2 1 2 ...
##  $ monetary_score   : int [1:995] 3 2 3 3 5 5 5 1 2 1 ...
##  $ rfm_score        : num [1:995] 343 412 323 433 555 555 355 321 212 121 ...
# #Recency: Unlike the other Two its Ranking feels reversed i.e. 5 is assigned to lowest value 
# #However 5 is assigned to 'Most Recent'
rfm_result$rfm %>% 
  group_by(recency_score) %>% 
  summarise(MIN = min(recency_days), MAX = max(recency_days), N = n()) 
## # A tibble: 5 x 4
##   recency_score   MIN   MAX     N
##           <int> <dbl> <dbl> <int>
## 1             1   482   976   197
## 2             2   298   481   200
## 3             3   181   297   199
## 4             4   116   180   199
## 5             5     1   114   200
# #Frequency
rfm_result$rfm %>% 
  group_by(frequency_score) %>% 
  summarise(MIN = min(transaction_count), MAX = max(transaction_count), N = n()) 
## # A tibble: 5 x 4
##   frequency_score   MIN   MAX     N
##             <int> <dbl> <dbl> <int>
## 1               1     1     3   268
## 2               2     4     4   187
## 3               3     5     5   176
## 4               4     6     7   244
## 5               5     8    14   120
# #Monetrary
rfm_result$rfm %>% 
  group_by(monetary_score) %>% 
  summarise(MIN = min(amount), MAX = max(amount), N = n()) 
## # A tibble: 5 x 4
##   monetary_score   MIN   MAX     N
##            <int> <dbl> <dbl> <int>
## 1              1    12   255   200
## 2              2   258   381   200
## 3              3   382   506   198
## 4              4   507   665   202
## 5              5   668  1488   195

Reading Back

# #Read CSV
jj <- read_csv(loc_src, show_col_types = FALSE) %>% 
  mutate(across(c(recency_score, frequency_score, monetary_score), as.integer))
ii <- rfm_result$rfm
#
attr(jj, "spec") <- NULL
attr(jj, "problems") <- NULL
# #Verification
all_equal(ii, jj) #TRUE
## [1] TRUE
#
attributes(ii)$class
## [1] "tbl_df"     "tbl"        "data.frame"
attributes(jj)$class
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
#
# #Modify Class Attribute i.e. Remove 1st "spec_tbl_df"
attr(jj, "class") <- attr(jj, "class")[-1]
#
all.equal(ii, jj) #TRUE
## [1] TRUE
identical(ii, jj) #TRUE
## [1] TRUE
#
# #NOTE Position of Attributes does not matter
names(attributes(ii))
## [1] "names"     "row.names" "class"
names(attributes(jj))
## [1] "row.names" "names"     "class"

Date Transformation

# #character to date using dmy()
bb <- aa
str(bb)
## spec_tbl_df [4,906 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ customer_id: chr [1:4906] "Mr. Brion Stark Sr." "Ethyl Botsford" "Hosteen Jacobi" "Mr. Edw Frami" ...
##  $ order_date : chr [1:4906] "20-12-2004" "02-05-2005" "06-03-2004" "15-03-2006" ...
##  $ revenue    : num [1:4906] 32 36 116 99 76 56 108 183 30 13 ...
#
ii <- bb
ii$order_date <- dmy(ii$order_date)
#
# #Equivalent
jj <- bb %>% mutate(order_date = dmy(order_date))
stopifnot(identical(ii, jj))
#
str(jj)
## spec_tbl_df [4,906 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ customer_id: chr [1:4906] "Mr. Brion Stark Sr." "Ethyl Botsford" "Hosteen Jacobi" "Mr. Edw Frami" ...
##  $ order_date : Date[1:4906], format: "2004-12-20" "2005-05-02" "2004-03-06" ...
##  $ revenue    : num [1:4906] 32 36 116 99 76 56 108 183 30 13 ...
anyNA(jj)
## [1] FALSE
summary(jj)
##  customer_id          order_date            revenue      
##  Length:4906        Min.   :2001-10-29   Min.   : 10.00  
##  Class :character   1st Qu.:2004-10-07   1st Qu.: 45.00  
##  Mode  :character   Median :2005-06-08   Median : 81.00  
##                     Mean   :2005-05-29   Mean   : 94.61  
##                     3rd Qu.:2005-11-08   3rd Qu.:137.00  
##                     Max.   :2006-12-30   Max.   :219.00

11.7 Develop Segments

Segment rules might look like arbitory however lots of thought goes into this. This is a tedious task.

  • Question: What happens with overlap e.g. which label will be assigned to the customer with 444.
    • In this dataset, there is no overlap
    • Further, If some customer score (125 possibilities) is outside the (10) specification, it is classified as ‘Others’
    • (Aside) It remains a concern “ForLater”
  • Question: Is it necessary that all the segments need to be covered
    • No, not necessary but highly recommended
    • We are doing the analysis with some plan of action and it is preferable that we can put them in different and proper buckets so that specific actions can be taken
# #Developing segments
segment_titles <- c("First Grade", "Loyal", "Likely to be Loyal", "New Ones", 
                    "Could be Promising", "Require Assistance", "Getting Less Frequent",
                    "Almost Out", "Can not Lose Them", "Do not Show Up at All") 
# #Rules of Minimum and Maximum RFM for each group
r_low  <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
r_high <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
f_low  <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
f_high <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
m_low  <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
m_high <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
#
stopifnot(all(vapply(list(r_low, r_high, f_low, f_high, m_low, m_high), 
                     FUN = function(x) identical(length(x), length(segment_titles)), logical(1))))
divisions <- rfm_segment(rfm_result, segment_names = segment_titles, 
                       recency_lower = r_low, recency_upper = r_high, 
                       frequency_lower = f_low, frequency_upper = f_high, 
                       monetary_lower = m_low, monetary_upper = m_high)
# #Output is a Tibble 
# #Save the Result in a CSV
loc_src <- paste0(.z$XL, "B19-Transaction-Divisions.csv")
if(FALSE) write_csv(divisions, file = loc_src)
#
# #We defined 10 segments, However only 7 (+1) of them are represented in the data 
# #and 48 customers were not captured by our classifications. These were assigned to 'Others'
divisions %>% 
  count(segment) %>% 
  mutate(PCT = round(100 * n / sum(n), 1)) %>% 
  rename(SEGMENT = segment, FREQ = n) %>% 
  arrange(desc(FREQ)) 
## # A tibble: 8 x 3
##   SEGMENT                FREQ   PCT
##   <chr>                 <int> <dbl>
## 1 Loyal                   278  27.9
## 2 Likely to be Loyal      229  23  
## 3 First Grade             158  15.9
## 4 Do not Show Up at All   111  11.2
## 5 Almost Out               86   8.6
## 6 Getting Less Frequent    50   5  
## 7 Others                   48   4.8
## 8 Require Assistance       35   3.5
#

11.8 Plots

  • (Aside)
    • Problem with these plots is that we ourselves have defined bands for each group.
    • We can do the if-then-else for complete data and we will get exact values in numbers.
    • ‘First Grade’ have high median Frequency because we have defined this label having rank [4, 5] in recency (and combinations of other two).
    • ‘Do not Show Up at All’ has lower median frequency because we have defined this label having rank [1, 2] in recency (and combinations of other two).
  • Question: Here we have shown Median can we do this with Mean
    • No, median is more authentic
    • (Aside) These are Ordered Categorical Values [1, 5]. Thus, Median is meaningful, not the Mean.
# #Histogram of Median RFM can be plotted. 
# #These are ggplot graphs so can be improved later by manually plotting
if(FALSE) {#Histograms of Median RFM for each Segment
  hh <- divisions
  rfm_plot_median_recency(hh)
  rfm_plot_median_frequency(hh)
  rfm_plot_median_monetary(hh)
}
if(FALSE) {
  hh <- rfm_result
  rfm_histograms(hh) #Histograms of RFM
  rfm_order_dist(hh) #Histograms of Customer Orders i.e. Frequency
  rfm_heatmap(hh)    #Heatmap of Monetary on Axes of Recency and Frequency. Slighly Useful
  rfm_bar_chart(hh)  #Bar Charts with Facetting of RFM
  # #Scatter Plots among Recency, Monetary, Frequency
  rfm_rm_plot(hh)
  rfm_fm_plot(hh)
  rfm_rf_plot(hh)
}

11.9 RFM on Customer

Please import the "B19-Customer.csv"

  • rfm_table_customer()
    • Use Customer level data
  • Question: If we are using Recency Days given in the data then what is the use of providing Analysis Date
    • “”
    • (Aside) I am unclear about the answer to this question given at “2021-11-14 18:08” “ForLater”
# #character to date using dmy() #wwww
bb <- aa <- xxB19Customer
str(bb)
## spec_tbl_df [39,999 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ customer_id      : num [1:39999] 22086 2290 26377 24650 12883 ...
##  $ revenue          : num [1:39999] 777 1555 336 1189 1229 ...
##  $ most_recent_visit: chr [1:39999] "14-05-2006" "08-09-2006" "19-11-2006" "29-10-2006" ...
##  $ number_of_orders : num [1:39999] 9 16 5 12 12 11 17 11 9 10 ...
##  $ recency_days     : num [1:39999] 232 115 43 64 23 72 112 142 43 131 ...
bb$most_recent_visit <- dmy(bb$most_recent_visit)
#
# #Get Analysis Date as the Next Date after the Max Date in the Data
analysis_date <- max(bb$most_recent_visit) + 1 #as_date("2006-12-31")
#
# #RFM analysis by rfm_table_customer()
rfm_customer <- rfm_table_customer(bb, customer_id = customer_id, n_transactions = number_of_orders,
              recency_days = recency_days, total_revenue = revenue, analysis_date = analysis_date)
# #Output is a Tibble with some other attributes
# #Save the Result in a CSV
loc_src <- paste0(.z$XL, "B19-Customer-RFM.csv")
if(FALSE) write_csv(rfm_customer$rfm, file = loc_src)

11.10 RFM OnlineRetail

Please import the "B19-OnlineRetail.csv"

  • About: [541909, 8]
    • However the data has NA in Customer ID. We cannot impute Customer ID. Those rows should be eliminated
    • Unit Price 0 or Quantity 0 or negative (Returns) should be removed
    • InvoiceDate
      • (Aside) Caution: While the lectures on Nov-14 and Nov-21 showed the dates to be converted by assuming that the data is in “dd-mm-yyyy” format. However, the data actually is “mm-dd-yyyy”
  • Question: If we eliminate returns, should we not remove this related transaction also because we are keeping the data of that transaction as actual revenue
    • We can keep it for the purpose of this analysis. Some ‘unforseen circumstance’ led to the return. However, the actual transaction did happen. The customer did buy the product.
    • Further, currently we are interested in segregating the customers within different labels /types. We are not analysing the profit or growth, we are analysing customer purchase patterns in terms of RFM only.
    • (Argument) But then we are considering his Monetary Contribution on the higher side. If he has done a single transaction of 1000 dollars but later returned the product, the customer actually did not contribute anything to the company. However we will give him a higher ranking compared to another customer who purchased an item of 500 dollars.
bb <- aa <- xxB19Retail
#
# #NOTE dates are in mmddyyyy format
bb$InvoiceDate[5000:5010] 
##  [1] "12-02-2010" "12-02-2010" "12-02-2010" "12-02-2010" "12-02-2010" "12-02-2010" "12-02-2010"
##  [8] "12-02-2010" "12-02-2010" "12-02-2010" "12-02-2010"
bb$InvoiceDate <- mdy(bb$InvoiceDate)
#
# #Which Columns have NA
which(vapply(bb, anyNA, logical(1)))
## Description  CustomerID 
##           3           7
#
# #Remove NA | Remove Unit Price with 0 | Quantity 0 or Negative i.e. Returns | Dropped Columns |
# #Calculate Revenue
ii <- bb %>% 
  drop_na(CustomerID) %>% 
  filter(UnitPrice > 0 & Quantity > 0) %>% 
  select(-c(1:3, 8)) %>% 
  mutate(Revenue = UnitPrice * Quantity)
#
summary(ii)
##     Quantity         InvoiceDate           UnitPrice          CustomerID       Revenue         
##  Min.   :    1.00   Min.   :2010-12-01   Min.   :   0.001   Min.   :12346   Min.   :     0.00  
##  1st Qu.:    2.00   1st Qu.:2011-04-07   1st Qu.:   1.250   1st Qu.:13969   1st Qu.:     4.68  
##  Median :    6.00   Median :2011-07-31   Median :   1.950   Median :15159   Median :    11.80  
##  Mean   :   12.99   Mean   :2011-07-10   Mean   :   3.116   Mean   :15294   Mean   :    22.40  
##  3rd Qu.:   12.00   3rd Qu.:2011-10-20   3rd Qu.:   3.750   3rd Qu.:16795   3rd Qu.:    19.80  
##  Max.   :80995.00   Max.   :2011-12-09   Max.   :8142.750   Max.   :18287   Max.   :168469.60
# #Developing segments
segment_titles <- c("First Grade", "Loyal", "Likely to be Loyal", "New Ones", 
                    "Could be Promising", "Require Assistance", "Getting Less Frequent",
                    "Almost Out", "Can not Lose Them", "Do not Show Up at All") 
# #Rules of Minimum and Maximum RFM for each group
r_low  <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
r_high <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
f_low  <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
f_high <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
m_low  <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
m_high <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
#
stopifnot(all(vapply(list(r_low, r_high, f_low, f_high, m_low, m_high), 
                     FUN = function(x) identical(length(x), length(segment_titles)), logical(1))))
# #Get Analysis Date as the Next Date after the Max Date in the Data
analysis_date <- max(ii$InvoiceDate) + 1 #as_date("2011-12-10")
rfm_ii <- rfm_table_order(ii, customer_id = CustomerID, order_date = InvoiceDate, 
                              revenue = Revenue, analysis_date = analysis_date)
div_ii <- rfm_segment(rfm_ii, segment_names = segment_titles, 
                       recency_lower = r_low, recency_upper = r_high, 
                       frequency_lower = f_low, frequency_upper = f_high, 
                       monetary_lower = m_low, monetary_upper = m_high)
# #Sorted Count of Segments
div_ii %>% 
  count(segment) %>% 
  mutate(PCT = round(100 *n / sum(n), 1)) %>% 
  rename(SEGMENT = segment, FREQ = n) %>% 
  arrange(desc(FREQ)) 
## # A tibble: 8 x 3
##   SEGMENT                FREQ   PCT
##   <chr>                 <int> <dbl>
## 1 Loyal                  1163  26.8
## 2 First Grade             920  21.2
## 3 Likely to be Loyal      741  17.1
## 4 Almost Out              439  10.1
## 5 Do not Show Up at All   404   9.3
## 6 Others                  287   6.6
## 7 Getting Less Frequent   214   4.9
## 8 Require Assistance      170   3.9

Validation


12 K-means (B20, Nov-21)

12.1 Overview

  • “K-means Cluster Analysis”
    • “ForLater” - The PPT shared in the class was corrupted, need a working file.
    • Refer Book Merged Here

12.2 Packages

if(FALSE){# #WARNING: Installation may take some time.
  install.packages("factoextra", dependencies = TRUE)
}

12.3 Clustering

46.1 Clustering refers to the grouping of records, observations, or cases into classes of similar objects. Clustering differs from classification in that there is no target variable for clustering.

46.2 A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters.

  • Clustering is an unsupervised learning technique.

12.4 k-means Clustering

  • Question: Is it same as k-nearest neighbour
    • No, knn is a classification technique and is supervised. k-means is clustering method and is unsupervised.
  • Clustering algorithms seek to construct clusters of records such that the between-cluster variation is large compared to the within-cluster variation.
  • The variables needs to be scaled before the eculidean distance can be calculated to identify clusters
  • Outliers are also a problem. Normalisation does not help with outliers
  • The k-means algorithm can be applied only when the mean of cluster is defined.
    • Thus, the limitation is that we cannot apply k-means to categorical values

46.3 Euclidean distance between records is given by equation, \(d_{\text{Euclidean}}(x,y) = \sqrt{\sum_i{\left(x_i - y_i\right)^2}}\), where \(x = \{x_1, x_2, \ldots, x_m\}\) and \(y = \{y_1, y_2, \ldots, y_m\}\) represent the \({m}\) attribute values of two records.

  • Question: Should we need to ensure that in each cluster number of datapoints remain same
    • No

12.5 Algorithm

  1. Ask the user how many clusters \({k}\) the data set should be partitioned into.
    • Ex: \(k = 3\)
  2. Randomly assign \({k}\) records to be the initial cluster center locations.
  3. For each record, find the nearest cluster center.
    • Thus, in a sense, each cluster center “owns” a subset of the records, thereby representing a partition of the data set.
    • We therefore have \({k}\) clusters, \(\{C_1, C_2, \ldots, C_k\}\)
      • Ex: \({C_1 = (3, 9), C_2 = (7, 12), C_3 = (6, 18)}\)
    • The “nearest” criterion is usually Euclidean distance
  4. For each of the \({k}\) clusters, find the cluster centroid, and update the location of each cluster center to the new value of the centroid.
    • Obviously, centroid location need not to be an actual point within data like mean of a set of values need not to exist within that set itself.
  5. Repeat 3-5, until convergence or termination.
    • The algorithm terminates when the centroids no longer change.
      • In other words, the algorithm terminates when for all clusters \(\{C_1, C_2, \ldots, C_k\}\), all the records “owned” by each cluster center remain in that cluster.
    • Alternatively, the algorithm may terminate when some convergence criterion is met, such as no significant shrinkage in the mean squared error \(\text{MSE} = \frac{\text{SSE}}{N - k}\), where SSE represents the sum of squares error.
  • Question: Does the number of iterations is a function of initial random number
    • Yes
  • Question: Would we all, with different initial random number, reach the same cluster solutions
    • Yes

12.6 How many k

  • What is the number which would be practically feasible and statistically feasible too
    • ‘k’ should be the ‘best guess’ on the number of clusters present in the given data.
      • However, we may not have any idea about the possible number of clusters for high dimensional data and for data that is not scatterplotted
      • There is NO principled way to know what the value of ‘k’ ought to be.
      • We may try with successive values of ‘k’ starting with 2.
    • Within Cluster Sum of Squares (WSQ) represents within cluster variation i.e. inside cluster homogeneity.
      • we are expecting low value of WSQ (or MSE or SSE)
    • Between Cluster Sum of Sqares (BSQ) represents between cluster variation i.e. between cluster heterogeneity
      • we are expection high value of BSQ (or MSB or SSB)
    • SSE vs. k- looks like Scree Plot and Elbow method can be used to identify the optimum number of k
  • The iterative process is stopped when two consecutive ‘k’ values produce more or less identical results in terms of cluster within and between variances
    • However, it is possible that this ‘k’ value represents a local minima and not the global minima.

12.7 Data Movies

Please import the "B20-movie.csv"

  • About: [291, 6]
    • Each Row represents a unique customer and the average scores they have given to different movie genres
    • Normalisation
      • (Aside) Normalisation has been done here. However, this data is average rating on a 1-100 scale. So, (probably) it actually does not need the normalisation. “ForLater”

EDA

bb <- aa <- xxB20Movies
# #Drop ID | Scale | 
xw <- aa %>% select(-1) 
zw <- xw %>% mutate(across(everything(), ~ as.vector(scale(.))))
#
summary(xw)
##      Horror           Romcom          Action           Comedy          Fantasy      
##  Min.   :  0.00   Min.   : 0.00   Min.   : 24.60   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 40.00   1st Qu.:19.90   1st Qu.: 58.75   1st Qu.: 38.50   1st Qu.: 28.95  
##  Median : 62.80   Median :29.70   Median : 70.50   Median : 60.00   Median : 41.20  
##  Mean   : 58.57   Mean   :31.25   Mean   : 68.84   Mean   : 56.52   Mean   : 45.61  
##  3rd Qu.: 78.25   3rd Qu.:41.65   3rd Qu.: 80.55   3rd Qu.: 73.45   3rd Qu.: 59.85  
##  Max.   :100.00   Max.   :81.30   Max.   :100.00   Max.   :100.00   Max.   :100.00

12.8 k-means

  • Because I have chosen a different seed than the professor, my algorithm converged through different iterations but to the same clusters. However cluster 1 and cluster 2 got interchanged in the process.
    • In lecture the cluster 1 is cluster 2 here (size 218) and vice-versa.
# #Fix Seed
set.seed(3)
# #Cluster analysis with different k = {2, 3, 4}
k2_zw <- kmeans(zw, centers = 2)
k3_zw <- kmeans(zw, centers = 3)
k4_zw <- kmeans(zw, centers = 4)
#
names(k2_zw)
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   
## [7] "size"         "iter"         "ifault"
#
# #Two Clusters
ii <- k2_zw
# #within-cluster sum of squares (Preferred lower value i.e. Homogeneity within cluster)
ii$withinss  
## [1] 159.6976 605.7610
identical(ii$tot.withinss, sum(ii$withinss))
## [1] TRUE
# #between-cluster sum of squares
ii$betweenss 
## [1] 684.5413
# #The total sum of squares
ii$totss 
## [1] 1450
paste0("Between /Total = ",  round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 47.21%"
#
# #Members within Clusters
ii$size
## [1]  73 218
#
# #Matrix of cluster centres
round(ii$centers, 3)
##   Horror Romcom Action Comedy Fantasy
## 1 -1.383  1.200 -1.170  0.835   1.267
## 2  0.463 -0.402  0.392 -0.279  -0.424
#
# #Cluster Membership of each point
str(ii$cluster)
##  int [1:291] 2 2 2 2 2 2 2 2 2 2 ...
#
# #Save cluster membership of each point back into the dataset
res_movies <- cbind(xw, 
  list(k2 = k2_zw$cluster, k3 = k3_zw$cluster, k4 = k4_zw$cluster)) %>% as_tibble()
  • Explanation:
    • In normalised data, average is 0
      • Thus, positive values are above average, negative values are below average
  • Two Clusters: Cluster 2 (Size 218) and Cluster 1 (Size 73)
    • Cluster 2 gave Horror & Action movies above average ratings (Favourable)
    • Cluster 2 gave lower than average ratings for Romcom, Comedy, Fantasy (Unfavourable)
    • Behavour of Cluster 1 is completely opposite to Cluster 2
    • However, we cannot make a conclusion here because Between / Total is 47%
      • i.e. Too much heterogeneity within Cluster 2
  • Three Clusters: of Sizes 72, 105, 114 with Between /Total = 62% (improved i.e. within reduced)
    • We can analyse these clusters similar to above
  • Four Clusters: of Sizes 73, 51, 69, 98 with Between /Total = 64%
    • improved i.e. within reduced but not by much
    • NOTE: Here my cluster sizes are NOT matching with the lecture and the Between /Total is similar but not exactly same.
      • There are 2 reasons for that :
        • I fixed the seed once and then run the commands i.e. (Seed | k=2 | k=3 | k=4). Professor is fixing the seed each time i.e. (Seed | k=2 | Seed | k=3 | Seed | k=4)
        • I used different seed. Effect of starting from different seed is more pronounced as ‘k’ is increasing
# #Three Clusters
ii <- k3_zw
ii$size
## [1]  72 105 114
paste0("Between /Total = ",  round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 61.81%"
round(ii$centers, 3)
##   Horror Romcom Action Comedy Fantasy
## 1 -1.392  1.210 -1.177  0.816   1.296
## 2  0.906 -0.368  0.432 -1.108  -0.036
## 3  0.044 -0.425  0.346  0.505  -0.785
# #Four Clusters
ii <- k4_zw
ii$size
## [1] 73 51 69 98
paste0("Between /Total = ",  round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 64.73%"
round(ii$centers, 3)
##   Horror Romcom Action Comedy Fantasy
## 1  0.068 -0.899  0.362  0.530  -0.817
## 2  0.065  0.355  0.364  0.327  -0.572
## 3 -1.434  1.232 -1.214  0.848   1.325
## 4  0.925 -0.383  0.396 -1.162  -0.027

12.9 Elbow Plot of WSS

  • fviz_nbclust() :
    • Note that it performs the clustering on the original data. It does not take the already created clusters as input.
hh <- zw
cap_hh <- "B20P01"
ttl_hh <- "Movie: Elbow Curve (WSS)"
loc_png <- paste0(.z$PX, "B20P01", "-Movie-Elbow-Wss", ".png")
#
# #factoextra::fviz_nbclust() generates ggplot
# #method = "wss" (for total within sum of square)
B20P01 <- fviz_nbclust(hh, FUNcluster = kmeans, method = "wss") +
  labs(caption = cap_hh, title = ttl_hh)
(B20P01 B20P03) Movie: Elbow Curve (WSS) in FactoExtra and Base R(B20P01 B20P03) Movie: Elbow Curve (WSS) in FactoExtra and Base R

Figure 12.1 (B20P01 B20P03) Movie: Elbow Curve (WSS) in FactoExtra and Base R

12.10 Plot Clusters

(B20P02) Movie: Genres with k=3

Figure 12.2 (B20P02) Movie: Genres with k=3

Validation


13 Hierarchical Clustering (B21, Nov-28)

13.1 Overview

xxB20Movies <- f_getRDS(xxB20Movies)
bb <- aa <- xxB20Movies
# #Drop ID | Scale | 
xw <- aa %>% select(-1) 
zw <- xw %>% mutate(across(everything(), ~ as.vector(scale(.))))
#
summary(xw)
##      Horror           Romcom          Action           Comedy          Fantasy      
##  Min.   :  0.00   Min.   : 0.00   Min.   : 24.60   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 40.00   1st Qu.:19.90   1st Qu.: 58.75   1st Qu.: 38.50   1st Qu.: 28.95  
##  Median : 62.80   Median :29.70   Median : 70.50   Median : 60.00   Median : 41.20  
##  Mean   : 58.57   Mean   :31.25   Mean   : 68.84   Mean   : 56.52   Mean   : 45.61  
##  3rd Qu.: 78.25   3rd Qu.:41.65   3rd Qu.: 80.55   3rd Qu.: 73.45   3rd Qu.: 59.85  
##  Max.   :100.00   Max.   :81.30   Max.   :100.00   Max.   :100.00   Max.   :100.00

13.2 Elbow Plot of Silhouette

47.3 The silhouette is a characteristic of each data value. For each data value i,

47.3 \(\text{Silhouette}_i = s_i = \frac{b_i - a_i}{\text{max}(b_i, a_i)} \to s_i \in [-1, 1]\), where \(a_i\) is the distance between the data value (Cohesion) and its cluster center, and \(b_i\) is the distance between the data value and the next closest cluster center (Separation).

  • Refer Silhouette
    • Range [-1, 1]
    • A good solution has Silhouette value approaching 1
  • Question: Does the value of 0.2 (positive but near zero) is a good value
    • It is comparitive i.e. at what k you are getting max. silhouette value
    • Further, it might be considered as an indication that the dataset might not be ready for clustering
    • (Aside) a value close to zero is considered a weak assignment
  • Question: If the value is 0.2, can we claim that no clustering is required
    • No, sometimes the data has inherent heterogeneity. If the value is negative that implies some bad clustering. However small positive value does not imply anything.
  • Question: wss recommended 3, silhouette is recommending 2, now what
    • Use your own judgement
    • (Aside) "All models are wrong, but some are useful." - By Someone
  • Question: What is the purpose of this ‘Optimal Number of Clusters’ when we are using our own judgement anyway e.g. in Crime data, silhouette recommends 2 but we are leaning more towards 3
    • We are doing the k-means clustering when we have some idea about number of clusters. However the data might show something different. It is more about validation of our assumption.
    • However, if we do not have any idea about number of clusters, we should NOT use k-means clustering. Rather the Hierarchical Clustering should be used.
    • (Aside) "An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem." - By Someone
hh <- zw
cap_hh <- "B21P01"
ttl_hh <- "Movie: Elbow Curve (Silhouette)"
#
# #method = "silhouette" (for average silhouette width)
B21P01 <- fviz_nbclust(hh, FUNcluster = kmeans, method = "silhouette") +
  labs(caption = cap_hh, title = ttl_hh)
(B21P01) Movie: Elbow Curve of k (Silhouette)

Figure 13.1 (B21P01) Movie: Elbow Curve of k (Silhouette)

13.3 Data Crime

Please import the "B21-state-crime.csv"

aa <- xxB21Crime
# #Only Year 2019 | Exculte USA Total | Only Rates Variables NOT Total | Scale | 
xw <- aa %>% 
  filter(Year == "2019", State != "United States") %>% 
  select(Data.Population, starts_with("Data.Rates") & !ends_with("All"))
#
# #Rename Columns for ease of use
ii <- names(xw)
ii <- str_replace(ii, pattern = paste0(c("Data.Rates.", "Data."), collapse = "|"), "")
ii <- str_replace_all(ii, c("Violent." = "v_", "Property." = "p_"))
names(xw) <- ii
#
zw <- xw %>% mutate(across(everything(), ~ as.vector(scale(.))))
#
dim(xw)
## [1] 51  8
summary(xw)
##    Population         p_Burglary      p_Larceny         p_Motor        v_Assault    
##  Min.   :  578759   Min.   :126.3   Min.   : 911.8   Min.   : 47.8   Min.   : 61.3  
##  1st Qu.: 1789606   1st Qu.:243.2   1st Qu.:1193.5   1st Qu.:141.6   1st Qu.:165.9  
##  Median : 4467673   Median :328.7   Median :1555.7   Median :203.8   Median :246.3  
##  Mean   : 6436069   Mean   :345.6   Mean   :1580.4   Mean   :215.2   Mean   :257.2  
##  3rd Qu.: 7446805   3rd Qu.:412.4   3rd Qu.:1846.9   3rd Qu.:274.1   3rd Qu.:309.9  
##  Max.   :39512223   Max.   :696.8   Max.   :3775.4   Max.   :427.2   Max.   :650.5  
##     v_Murder          v_Rape         v_Robbery     
##  Min.   : 1.500   Min.   : 17.20   Min.   :  8.70  
##  1st Qu.: 2.550   1st Qu.: 36.85   1st Qu.: 41.10  
##  Median : 4.600   Median : 44.60   Median : 63.60  
##  Mean   : 5.145   Mean   : 47.66   Mean   : 68.89  
##  3rd Qu.: 6.400   3rd Qu.: 55.00   3rd Qu.: 80.95  
##  Max.   :23.500   Max.   :148.70   Max.   :384.40
(B21P02 B21P03) Crime: Elbow Curve of k Silhouette and WSS(B21P02 B21P03) Crime: Elbow Curve of k Silhouette and WSS

Figure 13.2 (B21P02 B21P03) Crime: Elbow Curve of k Silhouette and WSS

# #Cluster analysis with different k = {3, 4}
set.seed(3)
k3_zw <- kmeans(zw, centers = 3)
k4_zw <- kmeans(zw, centers = 4)
# #Save cluster membership of each point back into the dataset
res_crime <- cbind(xw, list(k3 = k3_zw$cluster, k4 = k4_zw$cluster)) %>% as_tibble()
#
# #Three Clusters
ii <- k3_zw
ii$size
## [1]  1 21 29
paste0("Between /Total = ",  round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 48.54%"
round(ii$centers, 3)
##   Population p_Burglary p_Larceny p_Motor v_Assault v_Murder v_Rape v_Robbery
## 1     -0.779     -0.597     4.574   1.185     2.612    4.963  0.063     5.708
## 2      0.176      0.934     0.599   0.864     0.556    0.356  0.291     0.200
## 3     -0.100     -0.656    -0.592  -0.666    -0.493   -0.429 -0.213    -0.342
#
# #Four Clusters
ii <- k4_zw
ii$size
## [1] 15 13 11 12
paste0("Between /Total = ",  round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 48.14%"
round(ii$centers, 3)
##   Population p_Burglary p_Larceny p_Motor v_Assault v_Murder v_Rape v_Robbery
## 1      0.733      0.274     0.255   0.377    -0.203    0.073 -0.273     0.287
## 2     -0.484     -0.303    -0.301  -0.184    -0.237   -0.526  0.222    -0.573
## 3     -0.365      1.240     1.118   1.026     1.464    1.200  0.773     0.775
## 4     -0.058     -1.151    -1.017  -1.213    -0.831   -0.621 -0.609    -0.448

13.4 Hierarchical Clustering

46.4 In hierarchical clustering, a treelike cluster structure (dendrogram) is created through recursive partitioning (divisive methods) or combining (agglomerative) of existing clusters.

46.5 Agglomerative clustering methods initialize each observation to be a tiny cluster of its own. Then, in succeeding steps, the two closest clusters are aggregated into a new combined cluster. In this way, the number of clusters in the data set is reduced by one at each step. Eventually, all records are combined into a single huge cluster. mMost computer programs that apply hierarchical clustering use agglomerative methods.

46.6 Divisive clustering methods begin with all the records in one big cluster, with the most dissimilar records being split off recursively, into a separate cluster, until each record represents its own cluster.

  • Ex: Flipkart
    • We would do Agglomerative Clustering i.e. start with 1000 customers and then get 10 clusters as final rather than starting with 1 cluster
  • Question: Are not the B2C companies trying for hyper localised individual level targetting of customers
    • No, they are creating higher number of groups based on wider characteristics. Noone is profiling a single customer rather the groups now are highly specific yet contain high number of customers.
  • Hierarchical
    • Distance Matrix is used to decide which clusters to merge or split
    • At least quadratic in number of data points
    • Not usable for large datasets
  • Notes on Divisive (Because Agglomerative will be in focus mainly)
    • Monothetic or Polythetic Methods
    • intercluster distance can be measured
    • Computationally intensive

13.5 Linkages

46.7 Single linkage, the nearest-neighbor approach, is based on the minimum distance between any record in cluster A and any record in cluster B. Cluster similarity is based on the similarity of the most similar members from each cluster. It tends to form long, slender clusters, which may sometimes lead to heterogeneous records being clustered together.

46.8 Complete linkage, the farthest-neighbor approach, is based on the maximum distance between any record in cluster A and any record in cluster B. Cluster similarity is based on the similarity of the most dissimilar members from each cluster. It tends to form more compact, spherelike clusters.

46.9 Average linkage is designed to reduce the dependence of the cluster-linkage criterion on extreme values, such as the most similar or dissimilar records. The criterion is the average distance of all the records in cluster A from all the records in cluster B. The resulting clusters tend to have approximately equal within-cluster variability. In general, average linkage leads to clusters more similar in shape to complete linkage than does single linkage.

  • How the distance matrix is calculated is the main difference between these Methods
    • Some more methods are Centroid Method, Ward Method etc.
  • Single Linkage
    • Positives
      • Can handle non-elliptical shapes
    • Negatives
      • Sensitive to Noise and Outliers
      • It produces long, elongated clusters
  • Complete Linkage
    • Positives
      • More balanced clusters (with equal diameters)
      • Less susceptible to noise
    • Negatives
      • Tends to break large clusters
      • All clusters tend to have the same diameter - small clusters are merged with larger ones
  • Average Linkage
    • Positives
      • Less susceptible to noise and outliers
    • Negatives
      • Biased towards globular clusters
  • Ward
    • Similar to average and centroid
    • Less susceptible to noise and outliers
    • Biased towards globular clusters
    • Hierarchical analgoue of k-means i.e. can be used to initialise k-means

Validation


14 PCA (B22, Dec-05)

14.1 Overview

xxB20Movies <- f_getRDS(xxB20Movies)
bb <- aa <- xxB20Movies
# #Drop ID | Scale | 
xw <- aa %>% select(-1) 
zw <- xw %>% mutate(across(everything(), ~ as.vector(scale(.))))
#
summary(xw)
##      Horror           Romcom          Action           Comedy          Fantasy      
##  Min.   :  0.00   Min.   : 0.00   Min.   : 24.60   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 40.00   1st Qu.:19.90   1st Qu.: 58.75   1st Qu.: 38.50   1st Qu.: 28.95  
##  Median : 62.80   Median :29.70   Median : 70.50   Median : 60.00   Median : 41.20  
##  Mean   : 58.57   Mean   :31.25   Mean   : 68.84   Mean   : 56.52   Mean   : 45.61  
##  3rd Qu.: 78.25   3rd Qu.:41.65   3rd Qu.: 80.55   3rd Qu.: 73.45   3rd Qu.: 59.85  
##  Max.   :100.00   Max.   :81.30   Max.   :100.00   Max.   :100.00   Max.   :100.00

14.2 Packages

if(FALSE){# #WARNING: Installation may take some time.
  install.packages("cluster", dependencies = TRUE)
  install.packages("arules", dependencies = TRUE)
  install.packages("arulesViz", dependencies = TRUE)
}

14.3 Hierarchical

  • Agglomerative Clustering is also known as ‘Bottom-up’ and Devisive Clustering is also known as ‘Top-down’

  • Upto 16:18 Mathematical Formulation of Linkages which are NOT included here.

  • Question: Nothing much can be inferred from Cluster 2, even though it has high number of data points (136). Does it call for further split

    • In other words, though the Action & Comedy does not show strong preference and other are negative. Thus there is a high possibility that it is heterogeneous in nature.
    • Probably Yes
    • So we looked at k=4 also, but the average silhouette value got reduced in Figure 14.2
  • Question: When k=2 the average silhouette value improved, so should we accept this as optimum number of clusters

    • The value has improved but one of these two clusters has large size (N =233) which is not that good a solution in terms of clustering
  • Question: Is it ok to mix and match k-means and hierarchical clustering

    • It does happen but we should not mix them. In one case we assume that we have some idea about cluster numbers, in the other one we do not place any assumption.
str(zw)
## tibble [291 x 5] (S3: tbl_df/tbl/data.frame)
##  $ Horror : num [1:291] 0.572 0.97 0.469 1.663 1.043 ...
##  $ Romcom : num [1:291] -0.0792 0.8276 0.751 -0.6032 -1.8397 ...
##  $ Action : num [1:291] -0.0152 0.4812 -0.2351 0.5692 -0.0466 ...
##  $ Comedy : num [1:291] -0.699 -1.728 -0.125 -1.374 -0.297 ...
##  $ Fantasy: num [1:291] 0.559 1.004 -0.355 -0.241 -0.255 ...
#
# #Create distance matrix
dist_zw <- dist(zw)
#
hclust_com_zw <- hclust(dist_zw, method = "complete")
hclust_avg_zw <- hclust(dist_zw, method = "average")
hclust_sng_zw <- hclust(dist_zw, method = "single")
#
# #Cut Tree by Cluster membership
k2_com_zw <- cutree(hclust_com_zw, 2)
k3_com_zw <- cutree(hclust_com_zw, 3)
k4_com_zw <- cutree(hclust_com_zw, 4)
#
table(k3_com_zw)
## k3_com_zw
##   1   2   3 
##  97 136  58
str(k3_com_zw)
##  int [1:291] 1 1 2 1 1 2 1 1 1 1 ...
# #Save cluster membership of each point back into the dataset
res_movies <- cbind(xw, list(k3 = k3_com_zw, k4 = k4_com_zw)) %>% as_tibble()
#
# #Cluster Mean
if(FALSE) aggregate(zw, by = list(k3_com_zw), FUN = function(x) round(mean(x), 3))
# #Equivalent
res_movies %>% select(-k4) %>% group_by(k3) %>% summarise(N = n(), across(everything(), mean))
## # A tibble: 3 x 7
##      k3     N Horror Romcom Action Comedy Fantasy
##   <int> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
## 1     1    97   80.2   23.8   77.0   31.2    45.7
## 2     2   136   58.0   26.7   72.4   66.1    32.4
## 3     3    58   23.7   54.5   47.0   76.5    76.5

14.4 Dendrogram

(B22P01) Movie: Dendrogram (Complete Linkage) with k =3 G, 4 B, 6 R

Figure 14.1 (B22P01) Movie: Dendrogram (Complete Linkage) with k =3 G, 4 B, 6 R

14.5 Silhouette

(B22P03 B22P04) Movie: Silhouette with Distance for k={3, 4}(B22P03 B22P04) Movie: Silhouette with Distance for k={3, 4}

Figure 14.2 (B22P03 B22P04) Movie: Silhouette with Distance for k={3, 4}

(B22P03 B22P02) Movie: Silhouette with Distance for k={3, 2}(B22P03 B22P02) Movie: Silhouette with Distance for k={3, 2}

Figure 14.3 (B22P03 B22P02) Movie: Silhouette with Distance for k={3, 2}

14.6 Association Rule Mining

48.1 Affinity analysis, (or Association Rules or Market Basket Analysis), is the study of attributes or characteristics that “go together.” It seeks to uncover rules for quantifying the relationship between two or more attributes. Association rules take the form "If antecedent, then consequent", along with a measure of the support and confidence associated with the rule.

  • It is unsupervised learning

  • Ex: People who purchased Milk, also purchased Bread.

  • Question: Is it similar to Conjoint Analysis

    • No
    • (Aside)
      • Conjoint analysis is a survey-based statistical technique used in market research that helps determine how people value different attributes (feature, function, benefits) that make up an individual product or service.
      • In Conjoint analysis, individual customer /user is distinguished whereas in Affinity analysis or the Market Basket analysis, individuals are not identified.
  • Problem: Dimensionality

    • The number of possible association rules grows exponentially in the number of attributes.
    • We can focus on relevant products i.e. high margin or low expiration range etc.
    • Further we can use the a priori algorithm
      • The a priori algorithm for mining association rules takes advantage of structure within the rules themselves to reduce the search problem to a more manageable size.
  • Refer Association Rules

48.2 The support (s) for a particular association rule \(A \Rightarrow B\) is the proportion of transactions in the set of transactions D that contain both antecedent A and consequent B. Support is Symmetric. \(\text{Support} = P(A \cap B) = \frac{\text{Number of transactions containing both A and B}}{\text{Total Number of Transactions}}\)

48.3 The confidence (c) of the association rule \(A \Rightarrow B\) is a measure of the accuracy of the rule, as determined by the percentage of transactions in the set of transactions D containing antecedent A that also contain consequent B. Confidence is Asymmetric \(\text{Confidence} = P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{\text{Number of transactions containing both A and B}}{\text{Total Number of Transactions containing A}}\)

48.8 Lift is a measure that can quantify the usefulness of an association rule. Lift is Symmetric. \(\text{Lift} = \frac{\text{Rule Confidence}}{\text{Prior proportion of Consequent}}\)

  • We generally try to find rules which have high Support, high Confidence and high Lift.

14.7 Data Makeup

Please import the "B22-Makeup.csv"

  • About: [1000, 14]
    • Each Column represents purchase decisions for each of the 1000 transactions
    • We need to have them as ‘factor’

EDA

bb <- aa <- xxB22Makeup
#
xw <- aa %>% mutate(across(everything(), factor, levels = c("No", "Yes")))
#str(xw)
dim(xw)
## [1] 1000   14
summary(xw)
##   Bag      Blush     Nail Polish Brushes   Concealer Eyebrow Pencils Bronzer   Lip liner Mascara  
##  No :946   No :637   No :720     No :851   No :558   No :958         No :721   No :766   No :643  
##  Yes: 54   Yes:363   Yes:280     Yes:149   Yes:442   Yes: 42         Yes:279   Yes:234   Yes:357  
##  Eye shadow Foundation Lip Gloss Lipstick  Eyeliner 
##  No :619    No :464    No :510   No :678   No :543  
##  Yes:381    Yes:536    Yes:490   Yes:322   Yes:457

14.8 apriori()

  • arules::apriori() :
    • The default behavior is to mine rules with minimum support of 0.1, minimum confidence of 0.8, maximum of 10 items (maxlen), and a maximal time for subset checking of 5 seconds (maxtime).
    • ‘parameter’ : These are Support and Confidence
      • help(`ASparameter-class`)
    • ‘appearance’ : These are Antecedents and Consequents
      • It can restrict item appearance
    • Caution: Never use inspect() without filtering out rows otherwise R may hang.
      • attributes(summary(rules))$length
  • Warning:
    • “Warning in apriori(xw) : Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!”
    • Increase the ‘maxlen’ parameter value
# #Caution is advised on running inspect() without prior subsetting /filtering the rules
# #Find association rules
#rules <- apriori(xw, maxlen = ncol(xw))
rules <- apriori(xw, parameter = list(maxlen = ncol(xw)))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target  ext
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1     14  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 100 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[28 item(s), 1000 transaction(s)] done [0.00s].
## sorting and recoding items ... [26 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 11 done [0.04s].
## writing ... [68960 rule(s)] done [0.02s].
## creating S4 object  ... done [0.03s].
#
# #More Information
names(attributes(rules))
## [1] "quality" "info"    "lhs"     "rhs"     "class"
#
str(attributes(rules)$quality)
## 'data.frame':    68960 obs. of  5 variables:
##  $ support   : num  0.851 0.946 0.958 0.149 0.129 0.138 0.214 0.224 0.257 0.265 ...
##  $ confidence: num  0.851 0.946 0.958 1 0.866 ...
##  $ coverage  : num  1 1 1 0.149 0.149 0.149 0.234 0.234 0.279 0.279 ...
##  $ lift      : num  1 1 1 3.571 0.915 ...
##  $ count     : int  851 946 958 149 129 138 214 224 257 265 ...
str(attributes(rules)$info)
## List of 5
##  $ data         : symbol xw
##  $ ntransactions: int 1000
##  $ support      : num 0.1
##  $ confidence   : num 0.8
##  $ call         : chr "apriori(data = xw, parameter = list(maxlen = ncol(xw)))"
attributes(rules)$lhs
## itemMatrix in sparse format with
##  68960 rows (elements/transactions) and
##  28 columns (items)
attributes(rules)$rhs
## itemMatrix in sparse format with
##  68960 rows (elements/transactions) and
##  28 columns (items)
#
summary(rules)
## set of 68960 rules
## 
## rule length distribution (lhs + rhs):sizes
##     1     2     3     4     5     6     7     8     9    10    11 
##     3    85   942  4350 10739 17062 18066 11996  4665   972    80 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   7.000   6.547   8.000  11.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift            count      
##  Min.   :0.1000   Min.   :0.8000   Min.   :0.1000   Min.   :0.8781   Min.   :100.0  
##  1st Qu.:0.1150   1st Qu.:0.8669   1st Qu.:0.1250   1st Qu.:1.0389   1st Qu.:115.0  
##  Median :0.1370   Median :0.9453   Median :0.1490   Median :1.1568   Median :137.0  
##  Mean   :0.1582   Mean   :0.9260   Mean   :0.1717   Mean   :1.2021   Mean   :158.2  
##  3rd Qu.:0.1770   3rd Qu.:0.9821   3rd Qu.:0.1930   3rd Qu.:1.2440   3rd Qu.:177.0  
##  Max.   :0.9580   Max.   :1.0000   Max.   :1.0000   Max.   :3.5714   Max.   :958.0  
## 
## mining info:
##  data ntransactions support confidence                                                    call
##    xw          1000     0.1        0.8 apriori(data = xw, parameter = list(maxlen = ncol(xw)))
#
names(attributes(summary(rules)))
## [1] "lengths"       "lengthSummary" "length"        "quality"       "info"          "class"
attributes(summary(rules))$length #Check Number of Rules Here.
## [1] 68960
attributes(summary(rules))$lengths
## sizes
##     1     2     3     4     5     6     7     8     9    10    11 
##     3    85   942  4350 10739 17062 18066 11996  4665   972    80
#
# #inspect() Do not execute without knowing how many rows will be printed.
#inspect(rules[1:6]) 
#inspect(head(rules, 6))
inspect(head(rules, min(5, attributes(summary(rules))$length)))
##     lhs              rhs                  support confidence coverage lift      count
## [1] {}            => {Brushes=No}         0.851   0.8510000  1.000    1.0000000 851  
## [2] {}            => {Bag=No}             0.946   0.9460000  1.000    1.0000000 946  
## [3] {}            => {Eyebrow Pencils=No} 0.958   0.9580000  1.000    1.0000000 958  
## [4] {Brushes=Yes} => {Nail Polish=Yes}    0.149   1.0000000  0.149    3.5714286 149  
## [5] {Brushes=Yes} => {Bag=No}             0.129   0.8657718  0.149    0.9151922 129

14.9 Analysis

Create

# #Rules with more control and oversight.
rr_sup <- 0.7
rr_conf <- 0.8
rules <- apriori(xw, parameter = list(
  minlen = 2, maxlen = ncol(xw), support = rr_sup, confidence = rr_conf))
Table 14.1: (B22T01) Support = 0.7 & Confidence = 0.8 gives Rules = 15
SN LHS_Antecedent x RHS_Consequent Support Confidence Coverage Lift Count
1 {Nail Polish=No} => {Brushes=No} 0.72 1 0.72 1.18 720
2 {Brushes=No} => {Nail Polish=No} 0.72 0.846 0.851 1.18 720
3 {Lip liner=No} => {Bag=No} 0.732 0.956 0.766 1.01 732
4 {Lip liner=No} => {Eyebrow Pencils=No} 0.734 0.958 0.766 1 734
5 {Brushes=No} => {Bag=No} 0.817 0.96 0.851 1.01 817
6 {Bag=No} => {Brushes=No} 0.817 0.864 0.946 1.01 817
7 {Brushes=No} => {Eyebrow Pencils=No} 0.82 0.964 0.851 1.01 820
8 {Eyebrow Pencils=No} => {Brushes=No} 0.82 0.856 0.958 1.01 820
9 {Bag=No} => {Eyebrow Pencils=No} 0.909 0.961 0.946 1 909
10 {Eyebrow Pencils=No} => {Bag=No} 0.909 0.949 0.958 1 909
11 {Bag=No, Lip liner=No} => {Eyebrow Pencils=No} 0.703 0.96 0.732 1 703
12 {Eyebrow Pencils=No, Lip liner=No} => {Bag=No} 0.703 0.958 0.734 1.01 703
13 {Bag=No, Brushes=No} => {Eyebrow Pencils=No} 0.789 0.966 0.817 1.01 789
14 {Brushes=No, Eyebrow Pencils=No} => {Bag=No} 0.789 0.962 0.82 1.02 789
15 {Bag=No, Eyebrow Pencils=No} => {Brushes=No} 0.789 0.868 0.909 1.02 789

Inspect

# #Do not print more than 50 Rules at at time.
inspect(head(rules, min(50, attributes(summary(rules))$length)))
##      lhs                                   rhs                  support confidence coverage
## [1]  {Nail Polish=No}                   => {Brushes=No}         0.720   1.0000000  0.720   
## [2]  {Brushes=No}                       => {Nail Polish=No}     0.720   0.8460635  0.851   
## [3]  {Lip liner=No}                     => {Bag=No}             0.732   0.9556136  0.766   
## [4]  {Lip liner=No}                     => {Eyebrow Pencils=No} 0.734   0.9582245  0.766   
## [5]  {Brushes=No}                       => {Bag=No}             0.817   0.9600470  0.851   
## [6]  {Bag=No}                           => {Brushes=No}         0.817   0.8636364  0.946   
## [7]  {Brushes=No}                       => {Eyebrow Pencils=No} 0.820   0.9635723  0.851   
## [8]  {Eyebrow Pencils=No}               => {Brushes=No}         0.820   0.8559499  0.958   
## [9]  {Bag=No}                           => {Eyebrow Pencils=No} 0.909   0.9608879  0.946   
## [10] {Eyebrow Pencils=No}               => {Bag=No}             0.909   0.9488518  0.958   
## [11] {Bag=No, Lip liner=No}             => {Eyebrow Pencils=No} 0.703   0.9603825  0.732   
## [12] {Eyebrow Pencils=No, Lip liner=No} => {Bag=No}             0.703   0.9577657  0.734   
## [13] {Bag=No, Brushes=No}               => {Eyebrow Pencils=No} 0.789   0.9657283  0.817   
## [14] {Brushes=No, Eyebrow Pencils=No}   => {Bag=No}             0.789   0.9621951  0.820   
## [15] {Bag=No, Eyebrow Pencils=No}       => {Brushes=No}         0.789   0.8679868  0.909   
##      lift     count
## [1]  1.175088 720  
## [2]  1.175088 720  
## [3]  1.010162 732  
## [4]  1.000234 734  
## [5]  1.014849 817  
## [6]  1.014849 817  
## [7]  1.005817 820  
## [8]  1.005817 820  
## [9]  1.003015 909  
## [10] 1.003015 909  
## [11] 1.002487 703  
## [12] 1.012437 703  
## [13] 1.008067 789  
## [14] 1.017120 789  
## [15] 1.019961 789

Tibble

# # Limit Max Rows | To Tibble | Rename | Add Row Numbers | Relocate | Format decimals |
inspect(head(rules, min(50, attributes(summary(rules))$length))) %>% 
  as_tibble(.name_repair = 'unique') %>% 
  rename(x = '...2', LHS_Antecedent = lhs, RHS_Consequent = rhs) %>% 
  rename_with(str_to_title, .cols = where(is.numeric)) %>% 
  mutate(SN = row_number()) %>% relocate(SN) %>% 
  mutate(across(where(is.numeric), format, digits = 3, drop0trailing = TRUE, scientific = FALSE)) 
##      lhs                                   rhs                  support confidence coverage
## [1]  {Nail Polish=No}                   => {Brushes=No}         0.720   1.0000000  0.720   
## [2]  {Brushes=No}                       => {Nail Polish=No}     0.720   0.8460635  0.851   
## [3]  {Lip liner=No}                     => {Bag=No}             0.732   0.9556136  0.766   
## [4]  {Lip liner=No}                     => {Eyebrow Pencils=No} 0.734   0.9582245  0.766   
## [5]  {Brushes=No}                       => {Bag=No}             0.817   0.9600470  0.851   
## [6]  {Bag=No}                           => {Brushes=No}         0.817   0.8636364  0.946   
## [7]  {Brushes=No}                       => {Eyebrow Pencils=No} 0.820   0.9635723  0.851   
## [8]  {Eyebrow Pencils=No}               => {Brushes=No}         0.820   0.8559499  0.958   
## [9]  {Bag=No}                           => {Eyebrow Pencils=No} 0.909   0.9608879  0.946   
## [10] {Eyebrow Pencils=No}               => {Bag=No}             0.909   0.9488518  0.958   
## [11] {Bag=No, Lip liner=No}             => {Eyebrow Pencils=No} 0.703   0.9603825  0.732   
## [12] {Eyebrow Pencils=No, Lip liner=No} => {Bag=No}             0.703   0.9577657  0.734   
## [13] {Bag=No, Brushes=No}               => {Eyebrow Pencils=No} 0.789   0.9657283  0.817   
## [14] {Brushes=No, Eyebrow Pencils=No}   => {Bag=No}             0.789   0.9621951  0.820   
## [15] {Bag=No, Eyebrow Pencils=No}       => {Brushes=No}         0.789   0.8679868  0.909   
##      lift     count
## [1]  1.175088 720  
## [2]  1.175088 720  
## [3]  1.010162 732  
## [4]  1.000234 734  
## [5]  1.014849 817  
## [6]  1.014849 817  
## [7]  1.005817 820  
## [8]  1.005817 820  
## [9]  1.003015 909  
## [10] 1.003015 909  
## [11] 1.002487 703  
## [12] 1.012437 703  
## [13] 1.008067 789  
## [14] 1.017120 789  
## [15] 1.019961 789
## # A tibble: 15 x 9
##    SN    LHS_Antecedent                 x     RHS_Consequent Support Confidence Coverage Lift  Count
##    <chr> <chr>                          <chr> <chr>          <chr>   <chr>      <chr>    <chr> <chr>
##  1 " 1"  {Nail Polish=No}               =>    {Brushes=No}   0.72    1          0.72     1.18  720  
##  2 " 2"  {Brushes=No}                   =>    {Nail Polish=~ 0.72    0.846      0.851    1.18  720  
##  3 " 3"  {Lip liner=No}                 =>    {Bag=No}       0.732   0.956      0.766    1.01  732  
##  4 " 4"  {Lip liner=No}                 =>    {Eyebrow Penc~ 0.734   0.958      0.766    1     734  
##  5 " 5"  {Brushes=No}                   =>    {Bag=No}       0.817   0.96       0.851    1.01  817  
##  6 " 6"  {Bag=No}                       =>    {Brushes=No}   0.817   0.864      0.946    1.01  817  
##  7 " 7"  {Brushes=No}                   =>    {Eyebrow Penc~ 0.82    0.964      0.851    1.01  820  
##  8 " 8"  {Eyebrow Pencils=No}           =>    {Brushes=No}   0.82    0.856      0.958    1.01  820  
##  9 " 9"  {Bag=No}                       =>    {Eyebrow Penc~ 0.909   0.961      0.946    1     909  
## 10 "10"  {Eyebrow Pencils=No}           =>    {Bag=No}       0.909   0.949      0.958    1     909  
## 11 "11"  {Bag=No, Lip liner=No}         =>    {Eyebrow Penc~ 0.703   0.96       0.732    1     703  
## 12 "12"  {Eyebrow Pencils=No, Lip line~ =>    {Bag=No}       0.703   0.958      0.734    1.01  703  
## 13 "13"  {Bag=No, Brushes=No}           =>    {Eyebrow Penc~ 0.789   0.966      0.817    1.01  789  
## 14 "14"  {Brushes=No, Eyebrow Pencils=~ =>    {Bag=No}       0.789   0.962      0.82     1.02  789  
## 15 "15"  {Bag=No, Eyebrow Pencils=No}   =>    {Brushes=No}   0.789   0.868      0.909    1.02  789

Summarise Count Binary Columns

# #If Data has TRUE /FALSE in place of Yes /No, Then more options are available: sum() which()
# #Last Option 'mm' does not use table() and remain as tibble() so fewer steps are required
summary(xw)
##   Bag      Blush     Nail Polish Brushes   Concealer Eyebrow Pencils Bronzer   Lip liner Mascara  
##  No :946   No :637   No :720     No :851   No :558   No :958         No :721   No :766   No :643  
##  Yes: 54   Yes:363   Yes:280     Yes:149   Yes:442   Yes: 42         Yes:279   Yes:234   Yes:357  
##  Eye shadow Foundation Lip Gloss Lipstick  Eyeliner 
##  No :619    No :464    No :510   No :678   No :543  
##  Yes:381    Yes:536    Yes:490   Yes:322   Yes:457
# #Count Binary Columns | Transpose | Tibble | Integer | Sort |
ii <- t(vapply(xw, table, numeric(2))) %>% 
  as_tibble(rownames = 'Items') %>% 
  mutate(across(where(is.numeric), as.integer)) %>% 
  arrange(desc(Yes))
# #Match One of the Values | Transpose | Tibble | Rename | Wide | Rename | Sort |
jj <- t(table(xw == 'Yes', names(xw)[col(xw)])) %>% 
  as_tibble(.name_repair = 'unique') %>% 
  rename(Items = 1, Key = 2) %>% 
  pivot_wider(names_from = Key, values_from = n) %>% 
  rename(No = 2, Yes = 3) %>% 
  arrange(desc(Yes))
# #Unlist | Remove Appended Numbers | Count | Transpose | Tibble | Rename | Wide | Rename | Sort |
kk <- t(table(unlist(xw), sub('\\d+', '', names(unlist(xw))))) %>% 
  as_tibble(.name_repair = 'unique') %>% 
  rename(Items = 1, Key = 2) %>% 
  pivot_wider(names_from = Key, values_from = n) %>% 
  rename(No = 2, Yes = 3) %>% 
  arrange(desc(Yes))
# #Long | Table | Tibble | Wide | Rename | Sort | 
ll <- xw %>% 
  pivot_longer(cols = everything(), names_to = 'Items', values_to = 'Key') %>% 
  table() %>% 
  as_tibble() %>% 
  pivot_wider(names_from = Key, values_from = n) %>% 
  rename(No = 2, Yes = 3) %>% 
  arrange(desc(Yes))
# #Long | Count | Wide | Sort | 
mm <- xw %>% 
  pivot_longer(cols = everything(), names_to = 'Items', values_to = 'Key') %>% 
  count(Items, Key) %>% 
  pivot_wider(names_from = Key, values_from = n) %>% 
  arrange(desc(Yes))
stopifnot(all(vapply(list(jj, kk, ll, mm), FUN = function(x) identical(x, ii), logical(1))))
#
# #Option 'mm' is preferable
xw %>% 
  pivot_longer(cols = everything()) %>% 
  count(name, value) %>% 
  pivot_wider(names_from = value, values_from = n) %>% 
  arrange(desc(Yes))
## # A tibble: 14 x 3
##    name               No   Yes
##    <chr>           <int> <int>
##  1 Foundation        464   536
##  2 Lip Gloss         510   490
##  3 Eyeliner          543   457
##  4 Concealer         558   442
##  5 Eye shadow        619   381
##  6 Blush             637   363
##  7 Mascara           643   357
##  8 Lipstick          678   322
##  9 Nail Polish       720   280
## 10 Bronzer           721   279
## 11 Lip liner         766   234
## 12 Brushes           851   149
## 13 Bag               946    54
## 14 Eyebrow Pencils   958    42

Multiple identical

mm <- ll <- kk <- jj <- ii <- 1:5
# #Pairwise Identical Check
all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm))
## [1] TRUE
#
# #Pairwise Identical Check using vapply()
# #It can provide info on which pair does not match OR can be passed to all()
vapply(list(jj, kk, ll, mm), FUN = function(x) identical(x, ii), logical(1))
## [1] TRUE TRUE TRUE TRUE
#
stopifnot(all(vapply(list(jj, kk, ll, mm), FUN = function(x) identical(x, ii), logical(1))))

14.10 inspect()

  • ‘Foundation’ has maximum Yes. It is the highest frequency item in the purchase.
    • So taking it as Consequent, we want to look at what are its antecedents
  • Question: Why have we reduced the support cutoff
    • If support is high then we will not get many rules because we have already restricted RHS to “Foundation” only
  • Question: Why have we reduced maxlen to 3, Can we not keep the original higher value
    • We can do that. But rules with too many products are not going to help us much.
    • (Aside) In general, Beyond 3 items the combination add complexity without enough benefits.
  • Question: Should not we consider the “Foundation” as LHS
    • OR Why Foundation is taken as Consequent (RHS) and not as Antecedent (LHS)
    • We can do that. For now we have chosen “Foundation” as Consequent (RHS)
  • inspect()
    • Ex: SN = 1: LHS_Antecedent_A {Lip Gloss=Yes} RHS_Consequent_B {Foundation=Yes}
      • 490 ‘Lip Gloss’ purchased in 1000 Total
      • 536 ‘Foundation’ purchased in 1000 Total
      • 356 ‘Foundation’ purchased within 490 ‘Lip Gloss’ purchases
      • \(\text{Prior Proportion} = \text{Support} = P(A \cap B) = \frac{\text{Number of transactions containing both A and B}}{\text{Total Number of Transactions}} = \frac{356}{1000} = 0.356\)
      • \(\text{Confidence} = P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{\text{Number of transactions containing both A and B}}{\text{Total Number of Transactions containing A}} = \frac{356}{490} = 0.727\)
        • An item set with higher confidence i.e. value near 1 means that this item set has higher likelihood of purchase
        • (Aside) Confidence however overestimates sometimes and is Asymmetric. So using Lift is the better option.
      • \(\text{Coverage} = \text{LHS Support} = P(A) = \frac{\text{Number of transactions containing A}}{\text{Total Number of Transactions}} = \frac{490}{1000} = 0.490\)
      • Count = Number of transactions containing both A and B = 356
    • Lift
      • Book \(\text{Lift} = \frac{\text{Confidence}}{\text{Prior Proportion of Consequent}} = \frac{\text{Confidence}}{\text{RHS Support}} = \frac{0.727}{0.536} = 1.36\)
      • Package \(\text{Lift} = \frac{\text{Coverage}}{\text{Support}} = \frac{0.490}{0.356} = 1.37\)
  • Question: There are some Redundant rules. Infact out of total 16 Rules, except SN {2, 3, 4}, Rule 1 is basically duplicated with other items being “No.”
    • We need to purify the rules
    • “ForLater”
  • Question: When restricting the “LHS” to Bag and Blush Yes only, their No rules are still present.
    • Reduce the support and confidence cutoff and use default=“none”
  • A rule is redundant if a more general rules with the same or a higher confidence exists.
    • That is, a more specific rule is redundant if it is only equally or even less predictive than a more general rule.
    • A rule is more general if it has the same RHS but one or more items removed from the LHS.

Specify RHS

# #Rules with more control and oversight. RHS: "Foundation=Yes"
rr_sup <- 0.1
rr_conf <- 0.7
rules <- suppressWarnings(apriori(xw, 
  parameter = list(minlen = 2, maxlen = 3, support = rr_sup, confidence = rr_conf),
  appearance = list(rhs = paste0(names(xw)[11], "=", levels(xw[[11]])[2]), 
                    default = "lhs")))
Table 14.2: (B22T02) Support = 0.1 & Confidence = 0.7 gives Rules = 16
SN LHS_Antecedent x RHS_Consequent Support Confidence Coverage Lift Count
1 {Lip Gloss=Yes} => {Foundation=Yes} 0.356 0.727 0.49 1.36 356
2 {Lip Gloss=Yes, Lipstick=Yes} => {Foundation=Yes} 0.116 0.734 0.158 1.37 116
3 {Mascara=Yes, Lip Gloss=Yes} => {Foundation=Yes} 0.13 0.718 0.181 1.34 130
4 {Eye shadow=Yes, Lip Gloss=Yes} => {Foundation=Yes} 0.146 0.726 0.201 1.36 146
5 {Lip Gloss=Yes, Eyeliner=No} => {Foundation=Yes} 0.2 0.76 0.263 1.42 200
6 {Concealer=No, Lip Gloss=Yes} => {Foundation=Yes} 0.215 0.79 0.272 1.47 215
7 {Eye shadow=No, Lip Gloss=Yes} => {Foundation=Yes} 0.21 0.727 0.289 1.36 210
8 {Blush=No, Lip Gloss=Yes} => {Foundation=Yes} 0.237 0.76 0.312 1.42 237
9 {Mascara=No, Lip Gloss=Yes} => {Foundation=Yes} 0.226 0.731 0.309 1.36 226
10 {Lip Gloss=Yes, Lipstick=No} => {Foundation=Yes} 0.24 0.723 0.332 1.35 240
11 {Nail Polish=No, Lip Gloss=Yes} => {Foundation=Yes} 0.267 0.75 0.356 1.4 267
12 {Bronzer=No, Lip Gloss=Yes} => {Foundation=Yes} 0.295 0.845 0.349 1.58 295
13 {Lip liner=No, Lip Gloss=Yes} => {Foundation=Yes} 0.31 0.829 0.374 1.55 310
14 {Brushes=No, Lip Gloss=Yes} => {Foundation=Yes} 0.313 0.742 0.422 1.38 313
15 {Bag=No, Lip Gloss=Yes} => {Foundation=Yes} 0.335 0.728 0.46 1.36 335
16 {Eyebrow Pencils=No, Lip Gloss=Yes} => {Foundation=Yes} 0.345 0.728 0.474 1.36 345

Verify a Rule

# #Specific Rule: SN = 1: LHS_Antecedent {Lip Gloss=Yes} RHS_Consequent {Foundation=Yes}
ii <- xw %>% select(11, 12) %>% rename(Lip_Gloss = 2) %>% count(Foundation, Lip_Gloss)
# #490 'Lip Gloss' purchased in 1000 Total
ii %>% group_by(Lip_Gloss) %>% summarise(SUM = sum(n)) %>% mutate(PROP = SUM/sum(SUM))
## # A tibble: 2 x 3
##   Lip_Gloss   SUM  PROP
##   <fct>     <int> <dbl>
## 1 No          510  0.51
## 2 Yes         490  0.49
# #536 'Foundation' purchased in 1000 Total
ii %>% group_by(Foundation) %>% summarise(SUM = sum(n)) %>% mutate(PROP = SUM/sum(SUM))
## # A tibble: 2 x 3
##   Foundation   SUM  PROP
##   <fct>      <int> <dbl>
## 1 No           464 0.464
## 2 Yes          536 0.536
# #356 'Foundation' purchased within 490 'Lip Gloss' purchases
ii %>% filter(Lip_Gloss == 'Yes') %>% mutate(PROP = n/sum(n))
## # A tibble: 2 x 4
##   Foundation Lip_Gloss     n  PROP
##   <fct>      <fct>     <int> <dbl>
## 1 No         Yes         134 0.273
## 2 Yes        Yes         356 0.727

Focus Specific LHS

# #RHS: "Foundation=Yes"
# #LHS: Only Bag Yes, Blush Yes | To identify these Rules supply lower support and confidence
# #(Default) = "both", "lhs", "rhs", "none". Specified the default appearance for all items ...
# #...not explicitly mentioned in the other elements of the list.
# #If default = "lhs" is supplied then redundant rules come up.
rr_sup <- 0.01
rr_conf <- 0.1
rules <- suppressWarnings(apriori(xw, 
  parameter = list(minlen = 2, maxlen = 3, support = rr_sup, confidence = rr_conf),
  appearance = list(rhs = paste0(names(xw)[11], "=", levels(xw[[11]])[2]), 
                    lhs = c("Bag=Yes", "Blush=Yes"), 
                    default = "none")))
Table 14.3: (B22T03) Support = 0.01 & Confidence = 0.1 gives Rules = 3
SN LHS_Antecedent x RHS_Consequent Support Confidence Coverage Lift Count
1 {Bag=Yes} => {Foundation=Yes} 0.031 0.574 0.054 1.071 31
2 {Blush=Yes} => {Foundation=Yes} 0.192 0.529 0.363 0.987 192
3 {Bag=Yes, Blush=Yes} => {Foundation=Yes} 0.019 0.594 0.032 1.108 19

Yes LHS Only

# #RHS: "Foundation=Yes"
# #LHS: All Yes Only 
rr_sup <- 0.01
rr_conf <- 0.1
rr_rhs <- 11L #index of "Foundation"
rules <- suppressWarnings(apriori(xw, 
  parameter = list(minlen = 2, maxlen = 3, support = rr_sup, confidence = rr_conf),
  appearance = list(rhs = paste0(names(xw)[rr_rhs], "=", levels(xw[[rr_rhs]])[2]), 
                    lhs = paste0(names(xw)[-rr_rhs], "=", levels(xw[[rr_rhs]])[2]), 
                    default = "none")))
Table 14.4: (B22T04) Support = 0.01 & Confidence = 0.1 gives Rules = 84
SN LHS_Antecedent x RHS_Consequent Support Confidence Coverage Lift Count
1 {Eyebrow Pencils=Yes} => {Foundation=Yes} 0.019 0.452 0.042 0.844 19
2 {Bag=Yes} => {Foundation=Yes} 0.031 0.574 0.054 1.071 31
3 {Brushes=Yes} => {Foundation=Yes} 0.074 0.497 0.149 0.927 74
4 {Lip liner=Yes} => {Foundation=Yes} 0.087 0.372 0.234 0.694 87
5 {Lipstick=Yes} => {Foundation=Yes} 0.167 0.519 0.322 0.968 167
6 {Nail Polish=Yes} => {Foundation=Yes} 0.143 0.511 0.28 0.953 143
7 {Bronzer=Yes} => {Foundation=Yes} 0.133 0.477 0.279 0.889 133
8 {Blush=Yes} => {Foundation=Yes} 0.192 0.529 0.363 0.987 192
9 {Mascara=Yes} => {Foundation=Yes} 0.192 0.538 0.357 1.003 192
10 {Eye shadow=Yes} => {Foundation=Yes} 0.211 0.554 0.381 1.033 211
11 {Eyeliner=Yes} => {Foundation=Yes} 0.238 0.521 0.457 0.972 238
12 {Lip Gloss=Yes} => {Foundation=Yes} 0.356 0.727 0.49 1.355 356
13 {Concealer=Yes} => {Foundation=Yes} 0.231 0.523 0.442 0.975 231
14 {Eyebrow Pencils=Yes, Lipstick=Yes} => {Foundation=Yes} 0.013 0.481 0.027 0.898 13
15 {Blush=Yes, Eyebrow Pencils=Yes} => {Foundation=Yes} 0.014 0.5 0.028 0.933 14
16 {Eyebrow Pencils=Yes, Mascara=Yes} => {Foundation=Yes} 0.01 0.435 0.023 0.811 10
17 {Eyebrow Pencils=Yes, Eye shadow=Yes} => {Foundation=Yes} 0.012 0.48 0.025 0.896 12
18 {Eyebrow Pencils=Yes, Lip Gloss=Yes} => {Foundation=Yes} 0.011 0.688 0.016 1.283 11
19 {Concealer=Yes, Eyebrow Pencils=Yes} => {Foundation=Yes} 0.011 0.5 0.022 0.933 11
20 {Bag=Yes, Brushes=Yes} => {Foundation=Yes} 0.01 0.5 0.02 0.933 10
21 {Bag=Yes, Lipstick=Yes} => {Foundation=Yes} 0.011 0.688 0.016 1.283 11
22 {Bag=Yes, Nail Polish=Yes} => {Foundation=Yes} 0.015 0.536 0.028 0.999 15
23 {Bag=Yes, Bronzer=Yes} => {Foundation=Yes} 0.011 0.5 0.022 0.933 11
24 {Bag=Yes, Blush=Yes} => {Foundation=Yes} 0.019 0.594 0.032 1.108 19
25 {Bag=Yes, Mascara=Yes} => {Foundation=Yes} 0.022 0.579 0.038 1.08 22
26 {Bag=Yes, Eye shadow=Yes} => {Foundation=Yes} 0.016 0.5 0.032 0.933 16
27 {Bag=Yes, Eyeliner=Yes} => {Foundation=Yes} 0.015 0.517 0.029 0.965 15
28 {Bag=Yes, Lip Gloss=Yes} => {Foundation=Yes} 0.021 0.7 0.03 1.306 21
29 {Bag=Yes, Concealer=Yes} => {Foundation=Yes} 0.021 0.6 0.035 1.119 21
30 {Brushes=Yes, Lip liner=Yes} => {Foundation=Yes} 0.023 0.338 0.068 0.631 23
31 {Brushes=Yes, Lipstick=Yes} => {Foundation=Yes} 0.028 0.571 0.049 1.066 28
32 {Nail Polish=Yes, Brushes=Yes} => {Foundation=Yes} 0.074 0.497 0.149 0.927 74
33 {Brushes=Yes, Bronzer=Yes} => {Foundation=Yes} 0.046 0.474 0.097 0.885 46
34 {Blush=Yes, Brushes=Yes} => {Foundation=Yes} 0.039 0.557 0.07 1.039 39
35 {Brushes=Yes, Mascara=Yes} => {Foundation=Yes} 0.039 0.47 0.083 0.877 39
36 {Brushes=Yes, Eye shadow=Yes} => {Foundation=Yes} 0.037 0.457 0.081 0.852 37
37 {Brushes=Yes, Eyeliner=Yes} => {Foundation=Yes} 0.034 0.436 0.078 0.813 34
38 {Brushes=Yes, Lip Gloss=Yes} => {Foundation=Yes} 0.043 0.632 0.068 1.18 43
39 {Brushes=Yes, Concealer=Yes} => {Foundation=Yes} 0.044 0.478 0.092 0.892 44
40 {Lip liner=Yes, Lipstick=Yes} => {Foundation=Yes} 0.025 0.333 0.075 0.622 25
41 {Nail Polish=Yes, Lip liner=Yes} => {Foundation=Yes} 0.036 0.387 0.093 0.722 36
42 {Bronzer=Yes, Lip liner=Yes} => {Foundation=Yes} 0.046 0.359 0.128 0.67 46
43 {Blush=Yes, Lip liner=Yes} => {Foundation=Yes} 0.051 0.411 0.124 0.767 51
44 {Lip liner=Yes, Mascara=Yes} => {Foundation=Yes} 0.042 0.393 0.107 0.732 42
45 {Lip liner=Yes, Eye shadow=Yes} => {Foundation=Yes} 0.042 0.375 0.112 0.7 42
46 {Lip liner=Yes, Eyeliner=Yes} => {Foundation=Yes} 0.046 0.354 0.13 0.66 46
47 {Lip liner=Yes, Lip Gloss=Yes} => {Foundation=Yes} 0.046 0.397 0.116 0.74 46
48 {Concealer=Yes, Lip liner=Yes} => {Foundation=Yes} 0.07 0.391 0.179 0.73 70
49 {Nail Polish=Yes, Lipstick=Yes} => {Foundation=Yes} 0.051 0.573 0.089 1.069 51
50 {Bronzer=Yes, Lipstick=Yes} => {Foundation=Yes} 0.044 0.489 0.09 0.912 44

Paste String to each element

# #Paste String "=Yes" to each element of a Vector except "Foundation"
names(xw)
##  [1] "Bag"             "Blush"           "Nail Polish"     "Brushes"         "Concealer"      
##  [6] "Eyebrow Pencils" "Bronzer"         "Lip liner"       "Mascara"         "Eye shadow"     
## [11] "Foundation"      "Lip Gloss"       "Lipstick"        "Eyeliner"
paste0(names(xw)[-11], "=", levels(xw[[11]])[2])
##  [1] "Bag=Yes"             "Blush=Yes"           "Nail Polish=Yes"     "Brushes=Yes"        
##  [5] "Concealer=Yes"       "Eyebrow Pencils=Yes" "Bronzer=Yes"         "Lip liner=Yes"      
##  [9] "Mascara=Yes"         "Eye shadow=Yes"      "Lip Gloss=Yes"       "Lipstick=Yes"       
## [13] "Eyeliner=Yes"

Validation


15 Mid Year Evaluation (B23, Dec-12)

15.1 Data Champo Carpets

Data

Please import the "B23-Champo.csv"

xxB23Champo <- c("xxB23Champo_2_Both", "xxB23Champo_3_Order", "xxB23Champo_4_Sample",
              "xxB23Champo_6_Cluster", "xxB23Champo_5_Reco", "xxB23Champo_7_Colours", 
              "xxB23Champo_8_SKU", "xxB23Champo_9_RecoTrans")
# #Dimensions of all of these datasets
str(lapply(xxB23Champo, function(x) {dim(eval(parse(text = x)))}))
## List of 8
##  $ : int [1:2] 18955 16
##  $ : int [1:2] 13135 12
##  $ : int [1:2] 5820 25
##  $ : int [1:2] 20 21
##  $ : int [1:2] 45 14
##  $ : int [1:2] 11 8
##  $ : int [1:2] 11 8
##  $ : int [1:2] 20 21

Import Excel

# #Path to the Excel File #read_delim(clipboard())
loc_src <- paste0(.z$XL, "B23-Champo.xlsx")
#excel_sheets(loc_src)
# #Read Sheets
xxB23Champo_2_Both      <- read_excel(path = loc_src, sheet = 2)
xxB23Champo_3_Order     <- read_excel(path = loc_src, sheet = 3)
xxB23Champo_4_Sample    <- read_excel(path = loc_src, sheet = 4)
xxB23Champo_6_Cluster   <- read_excel(path = loc_src, sheet = 6)
xxB23Champo_5_Reco      <- read_excel(path = loc_src, sheet = 5, range = "A1:U21")
xxB23Champo_7_Colours   <- read_excel(path = loc_src, sheet = 7, range = "A1:H12")
xxB23Champo_8_SKU       <- read_excel(path = loc_src, sheet = 7, range = "J1:Q12")
xxB23Champo_9_RecoTrans <- read_excel(path = loc_src, sheet = 5, range = "X1:AR21")
# #Save the Loaded data as Binary Files
for(ii in xxB23Champo){
  saveRDS(eval(parse(text = ii)), paste0(.z$XL, ii, ".rds"))
}

General Information

Process (3 weeks to 3 months) : Design \(\Rightarrow\) CAD (Visual, Material) \(\Rightarrow\) Procurement \(\Rightarrow\) Warehousing \(\Rightarrow\) Dying \(\Rightarrow\) Storage of Dyed Yarn \(\Rightarrow\) Preparation for Weaving or Hand-Tufting \(\Rightarrow\) Wounding \(\Rightarrow\) Finishing (edges etc.) \(\Rightarrow\) inspection \(\Rightarrow\) Dispatch.

Product categories (4 major) - hand-tufted carpets (least effort, most popular), hand knotted carpets (skilled, most expensive), Kilims (woolen, expensive) and Durries (Indian variant)

Company sent samples to the client as per …

  • the latest fiber and color trends
  • color and design attributes of their past purchases
  • raw material availability in the inventory (preferred, focused effort)
  • reproduced the swatches as sent by the client into samples

Cost-efficient way of selecting appropriate sample designs that could generate maximum revenue.

Belief: carpet attributes could be used for creating customer segments, which in turn could be used for developing models such as classification to identify customer preferences and recommendation systems

to identify the most important customers and the most important products and find a way to connect the two using suitable attributes from data and appropriate analytics models

15.2 Data Fanatasy Sports

Data

Please import the "B23-FantasySports.csv"

xxB23Sports <- c("xxB23Sports_Q3_2T_Paid", "xxB23Sports_Q3_2T_Free", "xxB23Sports_Q4_2T",
  "xxB23Sports_Q5_Chi_Player", "xxB23Sports_Q5_Chi_Captain", "xxB23Sports_Q6_2T_119_Select",
  "xxB23Sports_Q6_2T_119_NotSelect", "xxB23Sports_Q6_2T_6_Select", "xxB23Sports_Q6_2T_6_NotSelect",
  "xxB23Sports_Q7_Anova_NotSelect", "xxB23Sports_Q7_Anova_Captain", "xxB23Sports_Q7_Anova_VC",
  "xxB23Sports_Q8_Regression")
# #Dimensions of all of these datasets.
#sapply(lapply(xxChampo, function(x) {dim(eval(parse(text = x)))}), "[[", 1)
str(lapply(xxB23Sports, function(x) {dim(eval(parse(text = x)))}))
## List of 13
##  $ : int [1:2] 5180 3
##  $ : int [1:2] 8288 3
##  $ : int [1:2] 72 15
##  $ : int [1:2] 10 16
##  $ : int [1:2] 10 16
##  $ : int [1:2] 223738 2
##  $ : int [1:2] 22087 2
##  $ : int [1:2] 279868 2
##  $ : int [1:2] 159890 2
##  $ : int [1:2] 178691 3
##  $ : int [1:2] 225474 3
##  $ : int [1:2] 85710 3
##  $ : int [1:2] 55272 5

Import Excel

# #Path to the Excel File
loc_src <- paste0(.z$XL, "B23-FantasySports.xlsx")
# #Read Sheets
xxB23Sports_Q3_2T_Paid     <- read_excel(path = loc_src, sheet = 2, range = "A8:C5188")
xxB23Sports_Q3_2T_Free     <- read_excel(path = loc_src, sheet = 2, range = "E8:G8296")
xxB23Sports_Q4_2T          <- read_excel(path = loc_src, sheet = 3, range = "A8:O80")
xxB23Sports_Q5_Chi_Player  <- read_excel(path = loc_src, sheet = 4, range = "A15:P25")
xxB23Sports_Q5_Chi_Captain <- read_excel(path = loc_src, sheet = 4, range = "A31:P41")
xxB23Sports_Q8_Regression  <- read_excel(path = loc_src, sheet = 7, range = "A7:E55279")
#
# #Create CSV Files because of package failure in reading large excel data
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q6-2T-119-Select", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q6_2T_119_Select <- tbl
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q6-2T-119-NotSelect", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q6_2T_119_NotSelect <- tbl
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q6-2T-6-Select", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q6_2T_6_Select <- tbl
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q6-2T-6-NotSelect", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q6_2T_6_NotSelect <- tbl
#
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q7-Anova-NotSelect", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q7_Anova_NotSelect <- tbl
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q7-Anova-Captain", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q7_Anova_Captain <- tbl
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q7-Anova-VC", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q7_Anova_VC <- tbl
# #Save the Loaded data as Binary Files
for(ii in xxB23Sports){
  saveRDS(eval(parse(text = ii)), paste0(.z$XL, ii, ".rds"))
}

General Information

whether fantasy sports was a game of chance or skill, especially whether skill is a dominant factor in winning fantasy sports competition.

(Legal) The decision between skill and chance was to be decided based on whether the skill-based element was dominant over chance in determining the outcome of the game.

If a fantasy sports is chance based, then every user should have an equal probability of winning.

understand the key difference between skill and chance and how to test it using the data.

To prove that it is skill dominant, we have to prove that users who are scoring high in fantasy sports are playing a strategic game, their selection of players and captain and vice-captain is more knowledge based than random selection.

If fantasy games involve skill, then we can expect consistency in the performance of the users both low as well as high. Alternatively, we can also check whether a selection specific player increases probability of winning fantasy sports.

I think our approach should be to identify and test several possible hypotheses to establish whether fantasy sports is skill dominant or chance dominant.

  • Various rounds/match were played. For example, one IPL match would be one round.
  • There were players who were available to be picked up for a round or match. (This number is more than the number of players who would actually play in that match, so it was possible that a player selected by a user in his team may not actually play.)
  • For every round, multiple contests were opened. The contests were of different categories, from free to paid, and various types of playing and winning options (public, private, special).
  • User selected a team for a round and for a contest.
  • There was a player round performance table which indicated how the player performed in the specific round.
  • Teams selected by users were scored on the basis of the selected performance of player in a contest and those team level scores were provided in the contest user table

Few possible hypotheses are listed below:

  1. Users playing free contests are scoring lower than users playing paid contests. This can prove that when users play paid contests, they play more cautiously and strategically, and do not select teams at random.
  2. Scores of randomly selected players can be tested against scores of the teams based on a specific strategy such as selecting players who have performed well in the recent matches.
  3. Is the selection of captains and vice captains of the team random (equal probability)
  4. Selection of players and winning or getting high scores are dependent on each other.
  5. As the user plays more games, his chance of winning increases (learning effect).

11 Players, 100 credits, Captain 2x, VC 1.5x, Max. players from a Team =7 (C1 … C7)

15.3 Data RFM

Please import the "B23-RFM.csv"

15.4 Q1 RFM

As a data scientist, you would like to analyse recency, frequency, and monetary value of an online store. Based on the same, you would like to suggest suitable market segments so that the online store can implement marketing actions efficiently and effectively. In this attempt, use the data (See “B23-RFM.csv”), perform the RFM analysis, provide practical /managerial recommendations.

  • About: [2823, 25]
    • There are NA in 4 Columns, However none of these columns are required for RFM so are being kept.
    • Convert ORDERDATE to Date
  • Conclusion
    • We are loosing customers and we need to provide incentives for more frequent visits. It is clearly visible that higher frequency have direct positive correlation with higher monetary purchase.

RFM & Segments

bb <- aa <- xxB23RFM 
# #Convert to Date
bb$ORDERDATE <- dmy(bb$ORDERDATE)
# #Get Analysis Date as the Next Date after the Max Date in the Data
analysis_date <- max(bb$ORDERDATE) + 1 #as_date("2005-12-02")
#
# #RFM analysis by rfm_table_order()
rfm_result <- rfm_table_order(bb, customer_id = CUSTOMERNAME, order_date = ORDERDATE, 
                              revenue = SALES, analysis_date = analysis_date)
# #Output is a Tibble with some other attributes
loc_src <- paste0(.z$XL, "B23-Results-RFM.csv")
# #Save the Result in a CSV
if(FALSE) write_csv(rfm_result$rfm, file = loc_src)
# #Developing segments
segment_titles <- c("First Grade", "Loyal", "Likely to be Loyal", "New Ones", 
                    "Could be Promising", "Require Assistance", "Getting Less Frequent",
                    "Almost Out", "Can not Lose Them", "Do not Show Up at All") 
# #Rules of Minimum and Maximum RFM for each group
r_low  <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
r_high <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
f_low  <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
f_high <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
m_low  <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
m_high <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
#
stopifnot(all(vapply(list(r_low, r_high, f_low, f_high, m_low, m_high), 
                     FUN = function(x) identical(length(x), length(segment_titles)), logical(1))))
divisions <- rfm_segment(rfm_result, segment_names = segment_titles, 
                       recency_lower = r_low, recency_upper = r_high, 
                       frequency_lower = f_low, frequency_upper = f_high, 
                       monetary_lower = m_low, monetary_upper = m_high)
# #Output is a Tibble 
# #Save the Result in a CSV
loc_src <- paste0(.z$XL, "B23-Results-Divisions.csv")
if(FALSE) write_csv(divisions, file = loc_src)
#
# #We defined 10 segments, However only 7 (+1) of them are represented in the data 
# #and 48 customers were not captured by our classifications. These were assigned to 'Others'
divisions %>% 
  count(segment) %>% 
  mutate(PCT = round(100 * n / sum(n), 1)) %>% 
  rename(SEGMENT = segment, FREQ = n) %>% 
  arrange(desc(FREQ)) 
## # A tibble: 8 x 3
##   SEGMENT                FREQ   PCT
##   <chr>                 <int> <dbl>
## 1 First Grade              22  23.9
## 2 Likely to be Loyal       21  22.8
## 3 Loyal                    20  21.7
## 4 Almost Out               10  10.9
## 5 Do not Show Up at All     7   7.6
## 6 Getting Less Frequent     7   7.6
## 7 Require Assistance        4   4.3
## 8 Others                    1   1.1
#

Plots Not Plotted

if(FALSE) {#Histograms of Median RFM for each Segment
  hh <- divisions
  rfm_plot_median_recency(hh)
  rfm_plot_median_frequency(hh)
  rfm_plot_median_monetary(hh)
}
if(FALSE) {
  hh <- rfm_result
  rfm_histograms(hh) #Histograms of RFM
  rfm_order_dist(hh) #Histograms of Customer Orders i.e. Frequency
  rfm_heatmap(hh)    #Heatmap of Monetary on Axes of Recency and Frequency. Slighly Useful
  rfm_bar_chart(hh)  #Bar Charts with Facetting of RFM
  # #Scatter Plots among Recency, Monetary, Frequency
  rfm_rm_plot(hh)
  rfm_fm_plot(hh)
  rfm_rf_plot(hh)
}

NA

colSums(is.na(aa)) %>% as_tibble(rownames = "Cols") %>% filter(value > 0)
## # A tibble: 4 x 2
##   Cols         value
##   <chr>        <dbl>
## 1 ADDRESSLINE2  2521
## 2 STATE         1486
## 3 POSTALCODE      76
## 4 TERRITORY     1074

15.5 Q2 Sports P1

Dream 11 platform has both free and paid users, that is, users who play games for free with no return and users who pay a fee and obtain returns at the end of the game based on their relative performance. Can the average scores of paid and free users can help Dream 11 in testing skill- based game. (See Sheet “Qns_3_2SampleTTest” of “B23-FantasySports.xlsx”)

32.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)

  • (1: Free) \({n}_1 = 8288, {\overline{x}}_1 = 289.5, {\sigma}_1 = 91.6\)
  • (2: Paid) \({n}_2 = 5180, {\overline{x}}_2 = 301.2, {\sigma}_2 = 74.9\)
  • Compare with \({\alpha} = 0.05\)
    • \(\{{}^L\!P_{(t = -8.0797)} = 0 \} < {\alpha} \to {H_0}\) is rejected. Alternative is accepted.
  • Conclusion
    • It is a Skill based game because performance of Free users is lower than that of Paid users.
  • Question: ContestType was ignored in above analysis. However, the Means are different between public vs. private in case of Free and are same in case of Paid. How should we proceed
    • “ForLater”

Free vs. Paid

# #Data
free <- xxB23Sports_Q3_2T_Free$userpoints
paid <- xxB23Sports_Q3_2T_Paid$userpoints
#
# #Sample Information
round(vapply(f_namedList(free, paid), 
             FUN = function(x) {c(N = length(x), Mean = mean(x), SD = sd(x))}, 
             FUN.VALUE = numeric(3)), 1)
##        free   paid
## N    8288.0 5180.0
## Mean  289.5  301.2
## SD     91.6   74.9
#
# #Welch Two Sample t-test
ha_bb <- "less" #"two.sided" (Default), "less", "greater"
testT_bb <- t.test(x = free, y = paid, alternative = ha_bb)
testT_bb
## 
##  Welch Two Sample t-test
## 
## data:  free and paid
## t = -8.0797, df = 12539, p-value = 0.0000000000000003543
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -9.313015
## sample estimates:
## mean of x mean of y 
##  289.5123  301.2061
## p-value (0) is less than alpha (0.05).
## We can reject the H0 with 95% confidence. The populations are different.

Free: Public vs. Private

# #Contest Type Comparison for Free
bb <- xxB23Sports_Q3_2T_Free %>% 
  select(key = 2, value = 3) %>% 
  filter(key == "public" | key == "private") %>%
  mutate(across(key, factor))
#
# #Sample Information
bb %>% group_by(key) %>% summarise(N = n(), Mean = round(mean(value), 1), SD = round(sd(value), 1))
## # A tibble: 2 x 4
##   key         N  Mean    SD
##   <fct>   <int> <dbl> <dbl>
## 1 private   566  310.  89.5
## 2 public   7710  288   91.6
#
# #Welch Two Sample t-test
ha_bb <- "two.sided" #"two.sided" (Default), "less", "greater"
testT_bb <- t.test(formula = value ~ key, data = bb, alternative = ha_bb)
testT_bb
## 
##  Welch Two Sample t-test
## 
## data:  value by key
## t = 5.6664, df = 654.91, p-value = 0.00000002186
## alternative hypothesis: true difference in means between group private and group public is not equal to 0
## 95 percent confidence interval:
##  14.45501 29.78603
## sample estimates:
## mean in group private  mean in group public 
##              310.0967              287.9762
## p-value (0) is less than alpha (0.05).
## We can reject the H0 with 95% confidence. The populations are different.

Outcome

# #Compare p-value with alpha = 0.05
alpha <- 0.05
if(any(all(ha_bb == "two.sided", testT_bb$p.value >= alpha / 2), 
       all(ha_bb != "two.sided", testT_bb$p.value >= alpha))) {
  cat(paste0("p-value (", round(testT_bb$p.value, 6), ") is greater than alpha (", alpha, 
      "). We failed to reject H0. We cannot conclude that the populations are different.\n")) 
} else {
    cat(paste0("p-value (", round(testT_bb$p.value, 6), ") is less than alpha (", alpha, 
      ").\nWe can reject the H0 with 95% confidence. The populations are different.\n"))
}

15.6 Q2 Sports P2

Scores of users who use some strategy to select players such as recent performance of players are higher than users who select players randomly. (See Sheet “Qns_4_2SampleTTest” of “B23-FantasySports.xlsx”)

32.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)

  • (1: Random) \({n}_1 = 36, {\overline{x}}_1 = 249.2, {\sigma}_1 = 55.1\)
  • (2: Strategy) \({n}_2 = 36, {\overline{x}}_2 = 372.8, {\sigma}_2 = 42.2\)
  • Compare with \({\alpha} = 0.05\)
    • \(\{{}^L\!P_{(t = -10.69)} = 0 \} < {\alpha} \to {H_0}\) is rejected. Alternative is accepted.
  • Conclusion
    • It is a Skill based game because performance of users with Random strategy is lower than that of users with some strategy.
# #Team Type Comparison with correction of Typo
bb <- xxB23Sports_Q4_2T %>% 
  select(key = "TeamType", value = "totalpoints") %>% 
  mutate(across(key, str_replace, "Stratergy", "Strategy")) %>% 
  mutate(across(key, factor)) 
#
# #Sample Information
bb %>% group_by(key) %>% summarise(N = n(), Mean = round(mean(value), 1), SD = round(sd(value), 1))
## # A tibble: 2 x 4
##   key          N  Mean    SD
##   <fct>    <int> <dbl> <dbl>
## 1 Random      36  249.  55.1
## 2 Strategy    36  373.  42.2
#
# #Welch Two Sample t-test
ha_bb <- "less" #"two.sided" (Default), "less", "greater"
testT_bb <- t.test(formula = value ~ key, data = bb, alternative = ha_bb)
testT_bb
## 
##  Welch Two Sample t-test
## 
## data:  value by key
## t = -10.69, df = 65.563, p-value = 0.000000000000000263
## alternative hypothesis: true difference in means between group Random and group Strategy is less than 0
## 95 percent confidence interval:
##       -Inf -104.2835
## sample estimates:
##   mean in group Random mean in group Strategy 
##               249.1806               372.7500
## p-value (0) is less than alpha (0.05).
## We can reject the H0 with 95% confidence. The populations are different.

15.7 Q2 Sports P3

If fantasy-sports is a game of skill, then player performance has a major role in the player getting selected to a team as well as selection of captain or vice-captain. Using the data, can we test if selection of players in a team and getting high scores are dependent on each other. (See Sheet “Qns_5_2SampleTTest” of “B23-FantasySports.xlsx”)

  • User Category (Top Quartile or Not in Top Quartile) is Categorical
  • Player Selected (Yes or No) is Categorical
  • Since both the variables are categorical, we would need to perform Chi-square test
  • \(P_{\chi^2} < {\alpha} \to {H_0}\) is rejected. Alternative is accepted.
    • Population Proportions are different.
    • The sample results provide sufficient evidence that ‘selection of a plyer in the team’ and ‘user high scores’ are dependent on each other.
# #Select | Sum | Long | Separate String | Wide | Relocate | Column To RowNames |
bb <- xxB23Sports_Q5_Chi_Player %>% 
  select(nTop_nSelect = 3, Top_nSelect = 4, nTop_Select = 5, Top_Select = 6) %>% 
  summarise(across(everything(), sum)) %>% 
  pivot_longer(everything()) %>% 
  separate(name, c("isTop", "isSelect")) %>% 
  pivot_wider(names_from = isSelect, values_from = value) %>% 
  relocate(nSelect, .after = last_col()) %>% 
  column_to_rownames('isTop')
bb
##       Select nSelect
## nTop 1350174  109648
## Top   447240   16594
# #Chi-squared Test
chisq.test(bb)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  bb
## X-squared = 8881, df = 1, p-value < 0.00000000000000022

15.8 Q2 Sports P4

In the data supplied, a few users have selected one of the top three high performing players as captain or vice-captain. There are also users who have not used any of these players as captain or vice-captain. Ramsu claims that choosing high performing players as captain and/or vice-captain has an impact on the scores of the users. Test this claim made by Ramsu and link it to the business problem to make an inference. (See Sheet “Qns_7_Anova” of “B23-FantasySports.xlsx”)

  • ANOVA Conclusion:
    • Choosing high performing players as captain and/or vice- captain has an impact on the scores of the users.
  • Question: Coefficients of ANOVA show only “NoSelect” and “ViceCap.” “Captain” is missing. What does it mean
    • “ForLater”
# #NOTE: Because of High number of rows, data was exported to CSV and then Imported
NoSelect <- xxB23Sports_Q7_Anova_NotSelect %>% drop_na(userpoints) %>% select(userpoints)
Captain <- xxB23Sports_Q7_Anova_Captain %>% drop_na(userpoints) %>% select(userpoints)
ViceCap <- xxB23Sports_Q7_Anova_VC %>% drop_na(userpoints) %>% select(userpoints)
#
# #Merge Datasets by Rows
q2p4 <- bind_rows(NoSelect = NoSelect, Captain = Captain, ViceCap = ViceCap, .id = 'Type') 
# ANOVA
anv_q2p4 <- aov(userpoints ~ Type, data = q2p4)
anv_q2p4
## Call:
##    aov(formula = userpoints ~ Type, data = q2p4)
## 
## Terms:
##                      Type Residuals
## Sum of Squares   18812927 747815344
## Deg. of Freedom         2    431507
## 
## Residual standard error: 41.6297
## Estimated effects may be unbalanced
#
summary(anv_q2p4)
##                 Df    Sum Sq Mean Sq F value              Pr(>F)    
## Type             2  18812927 9406463    5428 <0.0000000000000002 ***
## Residuals   431507 747815344    1733                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# #Coefficients
anv_q2p4$coefficients
##  (Intercept) TypeNoSelect  TypeViceCap 
##   297.684103   -14.373297    -3.556425

15.9 Q3 Champo

Discuss clustering algorithms that can be used for segmenting customers of Champo Carpets. Apply both k-means and hierarchical clustering techniques and provide insights on the segments we can extract out these data.

Note: Use Sheet “Data Order ONLY” of “B23-Champo.xlsx.” It has 13,135 records. Use only the numerical variables (e.g., quantity required, total area, and amount) for performing cluster anlaysis.

  • k-means : 3 Clusters looks optimum or we should go with 6
    • Outliers are present and it looks like 6 is better number of clusters
  • Hierarchical : Dendrogram was not plotted because of noise
  • Conclusion
    • Country USA and Customer TGT have large share on revenue and these should be the focus area.
    • Orders: USA 86% UK 6% by Revenue
    • CustomerCode: “TGT 36%, H-2 12%”
    • Notes:
      • There are Orders having 0 Amount i.e. NO Revenue!
      • 6 Rows are 100% duplicated and some are almost duplicated i.e. DesignName, ColourName changed but no impact on Amount
      • There is a 1-1 Relationship between Country and Customer i.e. a Customer belongs to a specific country only
xw <- xxB23Champo_3_Order %>% select(Quantity = QtyRequired, Area = TotalArea, Amount)
zw <- xw %>% mutate(across(everything(), ~ as.vector(scale(.))))
str(xw)
## tibble [13,135 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Quantity: num [1:13135] 6 6 7 7 5 6 35 5 4 7 ...
##  $ Area    : num [1:13135] 128 117 88 88 117 ...
##  $ Amount  : num [1:13135] 770 702 616 616 585 ...
# #This is Slow.
hh <- zw
cap_hh <- "B23P01"
ttl_hh <- "Champo: Elbow Curve (WSS)"
#
# #factoextra::fviz_nbclust() generates ggplot
# #method = "wss" (for total within sum of square)
B23P01 <- fviz_nbclust(hh, FUNcluster = kmeans, method = "wss") +
  labs(caption = cap_hh, title = ttl_hh)
hh <- zw
cap_hh <- "B23P02"
ttl_hh <- "Champo: Elbow Curve (Silhouette)"
#
# #method = "silhouette" (for average silhouette width)
B23P02 <- fviz_nbclust(hh, FUNcluster = kmeans, method = "silhouette") +
  labs(caption = cap_hh, title = ttl_hh)

WSS vs. Silhouette

(B23P01 B23P02) Champo: WSS and Silhouette(B23P01 B23P02) Champo: WSS and Silhouette

Figure 15.1 (B23P01 B23P02) Champo: WSS and Silhouette

k-means

# #Fix Seed
set.seed(3)
# #Cluster analysis with k = {3, 6}
k3_zw <- kmeans(zw, centers = 3)
k6_zw <- kmeans(zw, centers = 6)
#
# #Save cluster membership of each point back into the dataset
res_champo <- cbind(xw, 
  list(k3 = k3_zw$cluster, k6 = k6_zw$cluster)) %>% as_tibble()
# #Three Clusters
ii <- k3_zw
ii$size
## [1] 9038 4043   54
paste0("Between /Total = ",  round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 51.52%"
round(ii$centers, 3)
##   Quantity   Area Amount
## 1    0.012 -0.566 -0.063
## 2   -0.138  1.262 -0.012
## 3    8.366  0.229 11.447
# #Six Clusters
ii <- k6_zw
ii$size
## [1] 6767 2862 1052   29   28 2397
paste0("Between /Total = ",  round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 79.95%"
round(ii$centers, 3)
##   Quantity   Area Amount
## 1    0.038 -0.735 -0.074
## 2   -0.124  0.920  0.007
## 3   -0.174  2.301 -0.036
## 4   -0.095  0.897 18.960
## 5   18.077 -0.651  1.816
## 6   -0.093 -0.035 -0.033

Plot k=3

(B23P03) Champo: k-means with k=3

Figure 15.2 (B23P03) Champo: k-means with k=3

Plot k=6

(B23P04) Champo: k-means with k=6

Figure 15.3 (B23P04) Champo: k-means with k=6

Hierarchical

str(zw)
## tibble [13,135 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Quantity: num [1:13135] -0.168 -0.168 -0.164 -0.164 -0.173 ...
##  $ Area    : num [1:13135] 1.667 1.442 0.864 0.864 1.442 ...
##  $ Amount  : num [1:13135] -0.0964 -0.1004 -0.1055 -0.1055 -0.1074 ...
#
# #Create distance matrix
dist_zw <- dist(zw)
#
hclust_com_zw <- hclust(dist_zw, method = "complete")
#hclust_avg_zw <- hclust(dist_zw, method = "average")
#hclust_sng_zw <- hclust(dist_zw, method = "single")
#
# #Cut Tree by Cluster membership
k3_com_zw <- cutree(hclust_com_zw, 3)
k4_com_zw <- cutree(hclust_com_zw, 4)
k6_com_zw <- cutree(hclust_com_zw, 6)
#
# #Save cluster membership of each point back into the dataset
hrc_champo <- cbind(xw, list(k3 = k3_com_zw, k4 = k4_com_zw, k6 = k6_com_zw)) %>% as_tibble()
#
# #Cluster Mean
if(FALSE) aggregate(zw, by = list(k3_com_zw), FUN = function(x) round(mean(x), 3))
# #Equivalent
hrc_champo %>% select(-c(k4, k6)) %>% group_by(k3) %>% 
  summarise(N = n(), across(everything(), mean))
## # A tibble: 3 x 5
##      k3     N Quantity  Area  Amount
##   <int> <dbl>    <dbl> <dbl>   <dbl>
## 1     1 13075     35.5  44.7   1592.
## 2     2    31     22.5  88.4 310634.
## 3     3    29   4097.   12.9  33562.

Validation


16 Association Rule (B24, Dec-19)

16.1 Overview

  • “Association Rule Mining”
  • Import Data Makeup - B22
  • Discussion on R Markdown / R Notebook from 16:45 - 17:20 has not been included.
  • “ForLater” Professor will cover “Attribution Model” later.
  • “ForLater” Book for Ridge, Lasso and Elastic Regression
xxB22Makeup <- f_getRDS(xxB22Makeup)
bb <- aa <- xxB22Makeup
#
xw <- aa %>% mutate(across(everything(), factor, levels = c("No", "Yes")))
#str(xw)
dim(xw)
## [1] 1000   14
summary(xw)
##   Bag      Blush     Nail Polish Brushes   Concealer Eyebrow Pencils Bronzer   Lip liner Mascara  
##  No :946   No :637   No :720     No :851   No :558   No :958         No :721   No :766   No :643  
##  Yes: 54   Yes:363   Yes:280     Yes:149   Yes:442   Yes: 42         Yes:279   Yes:234   Yes:357  
##  Eye shadow Foundation Lip Gloss Lipstick  Eyeliner 
##  No :619    No :464    No :510   No :678   No :543  
##  Yes:381    Yes:536    Yes:490   Yes:322   Yes:457

16.2 Packages

if(FALSE){# #WARNING: Installation may take some time.
  install.packages("arules", dependencies = TRUE)
  install.packages("arulesViz", dependencies = TRUE)
}

16.3 Redundant Rules

  • Question: What is the meaning of maxlen = 10. Will there be 10 Rules or 10 items
    • Default is 10. We will look at upto combination of 10 items
    • (Aside) The algorithm will search upto a set of maximum 10 unique items i.e. for maxlen = 2, Rules will be considered upto item pairs {AB, AC, BC} but for maxlen = 3, triplets will also be considered i.e. {ABC} & so on
  • Question: What is the meaning of minlen = 2
    • Number of items in the Basket
    • (Aside) If minlen is 2 then Antecedent of single item with Consequent of same item would also be considered a rule. i.e. People who bought Potato, also bought Potato. minlen = 2 ensures that at least two unique items would be present in the combination set of Antecedent and Consequent.
  • is.redundant()
    • It returns a logical vector
    • Rule \(\{A, B\} \Rightarrow \{Z\}\) is a more general rule compared to the specific rule \(\{A, B, C\} \Rightarrow \{Z\}\). In other words, First one is the superset of second one.
    • If the general rule has higher confidence than the specific rule, we do not need to look at the specific rule. Specific rule is redundant.
  • Question: What happens if the specific rule \(\{A, B, C\} \Rightarrow \{Z\}\) has higher confidence than general rule \(\{A, B\} \Rightarrow \{Z\}\)
    • Then we cannot delete
    • Refer Table 16.1
      • SN = 1 is Specific {Lip Gloss=Yes, Lipstick=Yes} whereas SN = 2 is General {Lip Gloss=Yes}. Yet, the specific rule is not categorised as redundant because it has higher confidence (0.734) than that of general rule (0.727)
Definition 16.1 A rule can be defined as redundant if a more general rules with the same or a higher confidence exists.

48.7 a priori property: If an itemset Z is not frequent, then for any item A, \(Z \cup A\) will not be frequent. In fact, no superset of Z (itemset containing Z) will be frequent.

  • Question: What happens if we put minlen = 0 or (minlen = 2 and maxlen = 2) because triplets are causing a lot of redundancy
    • There is no meaning of Set of items with 0 items
    • We can filter out the rules which are redundant
    • (Aside) Rather than the outright exclusion of set of items with 3 or more items, it is better to get more rules and then filter based on the criteria.
  • Question: What is the meaning of Lift < 1
    • It means that Antecedent should not be linked with the Consequent. That rule should not be considered.
    • You can also locate the item based on your own margin and other considerations.
    • Seasonal data would be different. Storewise data would be different.

Yes LHS Only

# #RHS: "Foundation=Yes" #xxx1
# #LHS: All Yes Only 
rr_sup <- 0.1
rr_conf <- 0.5
rr_rhs <- 11L #index of "Foundation"
rules <- suppressWarnings(apriori(xw, 
  parameter = list(minlen = 2, maxlen = 3, support = rr_sup, confidence = rr_conf),
  appearance = list(rhs = paste0(names(xw)[rr_rhs], "=", levels(xw[[rr_rhs]])[2]), 
                    lhs = paste0(names(xw)[-rr_rhs], "=", levels(xw[[rr_rhs]])[2]), 
                    default = "none")))
# #Check Redundancy of Rules beased on measure = {"confidence", "oddsRatio", "lift"}
# #Note these are columns names in rules, if these do not exist then creeate them
isRedConf <- is.redundant(rules, measure = "confidence")
isRedLift <- is.redundant(rules, measure = "lift")
#which(isRedConf)
hh <- rules
Table 16.1: (B24T01) Support = 0.1 & Confidence = 0.5 Excluding Redundant (Lift) & Sorted (Lift) gives Rules = 10
SN LHS_Antecedent RHS_Consequent Support Confidence Coverage Lift Count
1 {Lip Gloss=Yes,Lipstick=Yes} {Foundation=Yes} 0.116 0.734 0.158 1.37 116
2 {Lip Gloss=Yes} {Foundation=Yes} 0.356 0.727 0.49 1.355 356
3 {Eye shadow=Yes} {Foundation=Yes} 0.211 0.554 0.381 1.033 211
4 {Blush=Yes,Mascara=Yes} {Foundation=Yes} 0.101 0.549 0.184 1.024 101
5 {Mascara=Yes} {Foundation=Yes} 0.192 0.538 0.357 1.003 192
6 {Blush=Yes} {Foundation=Yes} 0.192 0.529 0.363 0.987 192
7 {Concealer=Yes} {Foundation=Yes} 0.231 0.523 0.442 0.975 231
8 {Eyeliner=Yes} {Foundation=Yes} 0.238 0.521 0.457 0.972 238
9 {Lipstick=Yes} {Foundation=Yes} 0.167 0.519 0.322 0.968 167
10 {Nail Polish=Yes} {Foundation=Yes} 0.143 0.511 0.28 0.953 143

Prune

# #Tibble excluding Redundant Rules #IN: rules Out: hh, pruned_tbl #xxx2
# #Rules | DataFrame | Tibble | Rename | TitleCase | Add Columns | Filter | Drop | SN | Relocate |
pruned_tbl <- DATAFRAME(rules) %>% as_tibble() %>% 
  rename(LHS_Antecedent = LHS, RHS_Consequent = RHS) %>% 
  rename_with(str_to_title, .cols = where(is.numeric)) %>% 
  mutate(isRedConf = isRedConf, isRedLift = isRedLift) %>% 
  filter(!isRedLift) %>% 
  select(-c(isRedConf, isRedLift)) %>% 
  arrange(desc(Lift)) %>% 
  mutate(SN = row_number()) %>% 
  relocate(SN) 
#
hh <- pruned_tbl
# #Sort and Subset are available for Rules (Similar to above) #xxx4
ii <- rules[!isRedLift]
pruned <- sort(ii, by = "lift")
inspect(pruned)
##      lhs                              rhs              support confidence coverage lift      count
## [1]  {Lip Gloss=Yes, Lipstick=Yes} => {Foundation=Yes} 0.116   0.7341772  0.158    1.3697336 116  
## [2]  {Lip Gloss=Yes}               => {Foundation=Yes} 0.356   0.7265306  0.490    1.3554676 356  
## [3]  {Eye shadow=Yes}              => {Foundation=Yes} 0.211   0.5538058  0.381    1.0332197 211  
## [4]  {Blush=Yes, Mascara=Yes}      => {Foundation=Yes} 0.101   0.5489130  0.184    1.0240915 101  
## [5]  {Mascara=Yes}                 => {Foundation=Yes} 0.192   0.5378151  0.357    1.0033864 192  
## [6]  {Blush=Yes}                   => {Foundation=Yes} 0.192   0.5289256  0.363    0.9868015 192  
## [7]  {Concealer=Yes}               => {Foundation=Yes} 0.231   0.5226244  0.442    0.9750456 231  
## [8]  {Eyeliner=Yes}                => {Foundation=Yes} 0.238   0.5207877  0.457    0.9716189 238  
## [9]  {Lipstick=Yes}                => {Foundation=Yes} 0.167   0.5186335  0.322    0.9675999 167  
## [10] {Nail Polish=Yes}             => {Foundation=Yes} 0.143   0.5107143  0.280    0.9528252 143
#
# #Qualtity of Rules (Dataframe of Numerics i.e. Support, Confidence, Coverage, Lift, Count) 
# #However it is sorted by confidence only and is of limited use only
quality(pruned)
##    support confidence coverage      lift count
## 9    0.116  0.7341772    0.158 1.3697336   116
## 7    0.356  0.7265306    0.490 1.3554676   356
## 5    0.211  0.5538058    0.381 1.0332197   211
## 10   0.101  0.5489130    0.184 1.0240915   101
## 4    0.192  0.5378151    0.357 1.0033864   192
## 3    0.192  0.5289256    0.363 0.9868015   192
## 8    0.231  0.5226244    0.442 0.9750456   231
## 6    0.238  0.5207877    0.457 0.9716189   238
## 1    0.167  0.5186335    0.322 0.9675999   167
## 2    0.143  0.5107143    0.280 0.9528252   143

16.4 Support vs Confidence

ERROR 16.1 Error in ... : plot.new has not been called yet
  • If the error occurs while using plot() or if the resultant png is blank, it might be a ggplot object rather than Base R.
  • Try to use ggplot commands including ggsave() on it.
  • OR if it is actually a base R plot, add plot.new() before calling plot()
ERROR 16.2 Error in plot(...) : non-numeric argument to binary operator
  • If the error occurs while using ggplot2 syntax, the object might be Base R plot.

ScatterPlot

(B24P01) Makeup: Support and Confidence with Lift as Gradient

Figure 16.1 (B24P01) Makeup: Support and Confidence with Lift as Gradient

Code

if(FALSE) pruned_tbl
hh <- pruned_tbl
#
ttl_hh <- "Makeup: Rules Excluding Redundant"
cap_hh <- "B24P01"
sub_hh <- "Note: To stretch the difference, X-axis is only upto 0.5 not 1"
#
if(FALSE) { #To Remove the lowest contrast white from the Palette. This Works.
  #Drop Lowest Two Colours in Palatte which has Max 9 Colours and get a continour function
  k_Palette_B24_O <- colorRampPalette(brewer.pal(9, "Oranges")[-c(1:2)]) 
  # #Use the Function to Rescale the gradients
  k_Scale_B24_O <- scale_colour_gradientn(colours = k_Palette_B24_O(10), 
                                          limits=c(min(hh$Lift), max(hh$Lift)))
}
#
B24 <- hh %>% { 
  ggplot(., aes(x = Support, y = Confidence, colour = Lift, label = LHS_Antecedent)) +
  geom_point() +
  geom_text_repel(max.overlaps = 20) +
  scale_colour_viridis_c(direction = -1) +
  #scale_colour_distiller(palette = "Oranges", direction = 1) +
  #k_Scale_B24_O + 
  scale_y_continuous(breaks = breaks_pretty(), limits = c(0, 1)) + 
  scale_x_continuous(breaks = breaks_pretty(), limits = c(0, 0.5)) + 
  #coord_fixed() +
  theme(plot.title.position = "panel", legend.position = c(0.9, 0.3)) +
  labs(subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B24)
rm(B24)

16.5 Data Basket

Please import the "B24-Basket.csv"

  • Question: What about other type of data like Orderwise or Customerwise
    • We will look at that also.
  • arules::read.transactions()
    • For reading Transaction Data
  • Warning:
    • “In asMethod(object) : removing duplicated items in transactions”
    • The problem is not with duplicated transactions (the same row appearing twice) but duplicated items (the same item appearing more than once in the same transaction)
    • Add rm.duplicates = TRUE to remove these objects
    • Question: What if you want to show that double the normal amount has been bought in a transaction by showing the items twice in the same tranaction
      • Then it is not the kind of information that ‘apriori’ handles
      • Further, ‘arules’ require transactions without duplicated items
        • It stores ‘sparse matrix’ which can store exist/not-exist for each item and cannot store quantity

EDA

bb <- aa <- xxB24Basket
#
# #It is a sparse matrix
names(attributes(aa))
## [1] "data"        "itemInfo"    "itemsetInfo" "class"
#
str(attributes(aa)$itemInfo$labels)
##  chr [1:167] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "bags" "baking powder" ...
#
names(attributes(summary(aa)))
## [1] "Dim"           "density"       "itemSummary"   "lengths"       "lengthSummary" "itemInfo"     
## [7] "itemsetInfo"   "class"
#
attributes(summary(aa))$Dim
## [1] 14963   167
#
attributes(summary(aa))$itemSummary
##       whole milk other vegetables       rolls/buns             soda           yogurt 
##             2363             1827             1646             1453             1285 
##          (Other) 
##            29432
#
attributes(summary(aa))$lengths
## sizes
##     1     2     3     4     5     6     7     8     9    10 
##   205 10012  2727  1273   338   179   113    96    19     1
#
#summary(aa)

Import Transactions

16.6 Item Frequency

Bar Plot

(B24P02) Basket: Absolute Item Frequency Plot of Items

Figure 16.2 (B24P02) Basket: Absolute Item Frequency Plot of Items

Code

# #Absolute Item Frequency Plot using sparse matrix from arules
hh <- aa 
nn_hh <- 15L
type_hh <-  "absolute" # "relative"
ttl_hh <- paste0("Basket: Absolute Item Frequency Plot of Items for Top N = ", nn_hh)
cap_hh <- "B24P02"
x_hh <- NULL
y_hh <- "Item Frequency (Absolute)"
loc_png <- paste0(.z$PX, "B24P02", "-Basket-Freq-Abs", ".png")
#
if(!file.exists(loc_png)) {
  png(filename = loc_png) 
  #dev.control('enable') 
  itemFrequencyPlot(hh, topN = nn_hh, type = type_hh, 
                    col = viridis(nn_hh), xlab = x_hh, ylab = y_hh, main = ttl_hh)
  title(sub = cap_hh, line = 4, adj = 1)
  B24 <- recordPlot()
  dev.off()
  assign(cap_hh, B24)
  rm(B24)
}

16.7 Rules for Basket

  • This did not provide any good result because Count is 2 for most of the rules which is of no use. It is totally random.
# #RHS: "Foundation=Yes"
# #LHS: All Yes Only 
xw <- aa
rr_sup <- 0.0001 #extremely low value used
rr_conf <- 0.5
#rr_rhs <- 11L #index of "Foundation"
rules <- suppressWarnings(apriori(xw, 
  parameter = list(minlen = 2, maxlen = 3, support = rr_sup, confidence = rr_conf)))
# #Check Redundancy of Rules beased on measure = {"confidence", "oddsRatio", "lift"}
isRedConf <- is.redundant(rules, measure = "confidence")
isRedLift <- is.redundant(rules, measure = "lift")
#
#which(isRedConf)
hh <- rules
Table 16.2: (B24T02) Basket: Not Useful (Count is 2 mostly)
SN LHS_Antecedent RHS_Consequent Support Confidence Coverage Lift Count
1 {frankfurter,vinegar} {softener} 0.000134 0.667 0.0002 243.3 2
2 {frankfurter,softener} {vinegar} 0.000134 0.667 0.0002 195.6 2
3 {frankfurter,liver loaf} {condensed milk} 0.000134 1 0.000134 152.7 2
4 {frozen vegetables,tea} {cat food} 0.000134 1 0.000134 84.5 2
5 {cereals,tropical fruit} {packaged fruit/vegetables} 0.000134 0.667 0.0002 78.5 2
6 {hygiene articles,sliced cheese} {liquor} 0.000134 0.5 0.000267 72.6 2
7 {frozen vegetables,vinegar} {grapes} 0.000134 1 0.000134 69.3 2
8 {bottled beer,salt} {misc. beverages} 0.000134 1 0.000134 63.4 2
9 {bottled beer,turkey} {misc. beverages} 0.000134 1 0.000134 63.4 2
10 {fruit/vegetable juice,specialty cheese} {long life bakery product} 0.000134 1 0.000134 55.8 2

16.8 Data Groceries

Please import the "B24-Basket.csv"

EDA

bb <- aa <- xxB24Groceries
#
str(attributes(aa)$itemInfo$labels)
##  chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" "bags" ...
#
attributes(summary(aa))$Dim
## [1] 9835  169
#
attributes(summary(aa))$itemSummary
##       whole milk other vegetables       rolls/buns             soda           yogurt 
##             2513             1903             1809             1715             1372 
##          (Other) 
##            34055
#
attributes(summary(aa))$lengths
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46   29   14   14    9 
##   21   22   23   24   26   27   28   29   32 
##   11    4    6    1    1    1    1    3    1
#
#summary(aa)

Import

16.9 Rules for Groceries

  • This did not provide any good result because Count is 2 for most of the rules which is of no use. It is totally random.
# #RHS: "Foundation=Yes"
# #LHS: All Yes Only 
xw <- aa
rr_sup <- 0.01 #extremely low value used
rr_conf <- 0.5
#rr_rhs <- 11L #index of "Foundation"
rules <- suppressWarnings(apriori(xw, 
  parameter = list(minlen = 2, maxlen = 3, support = rr_sup, confidence = rr_conf)))
# #Check Redundancy of Rules beased on measure = {"confidence", "oddsRatio", "lift"}
isRedConf <- is.redundant(rules, measure = "confidence")
isRedLift <- is.redundant(rules, measure = "lift")
#
#which(isRedConf)
hh <- rules
Table 16.3: (B24T03) Groceries: Support = 0.01 & Confidence = 0.5 Excluding Redundant (Lift) & Sorted (Lift) gives Rules = 15
SN LHS_Antecedent RHS_Consequent Support Confidence Coverage Lift Count
1 {citrus fruit,root vegetables} {other vegetables} 0.0104 0.586 0.0177 3.03 102
2 {root vegetables,tropical fruit} {other vegetables} 0.0123 0.585 0.021 3.02 121
3 {rolls/buns,root vegetables} {other vegetables} 0.0122 0.502 0.0243 2.59 120
4 {root vegetables,yogurt} {other vegetables} 0.0129 0.5 0.0258 2.58 127
5 {curd,yogurt} {whole milk} 0.0101 0.582 0.0173 2.28 99
6 {butter,other vegetables} {whole milk} 0.0115 0.574 0.02 2.24 113
7 {root vegetables,tropical fruit} {whole milk} 0.012 0.57 0.021 2.23 118
8 {root vegetables,yogurt} {whole milk} 0.0145 0.563 0.0258 2.2 143
9 {domestic eggs,other vegetables} {whole milk} 0.0123 0.553 0.0223 2.16 121
10 {whipped/sour cream,yogurt} {whole milk} 0.0109 0.525 0.0207 2.05 107

ScatterPlot

(B24P03) Groceries: Support vs. Confidence with Lift as Gradient and Two RHS

Figure 16.3 (B24P03) Groceries: Support vs. Confidence with Lift as Gradient and Two RHS

16.10 Regression and Classification Framework

45.1 In unsupervised methods, no target variable is identified as such. Instead, the data mining algorithm searches for patterns and structures among all the variables. The most common unsupervised data mining method is clustering. Ex: Voter Profile.

45.2 Supervised methods are those in which there is a particular prespecified target variable and the algorithm is given many examples where the value of the target variable is provided. This allows the algorithm to learn which values of the target variable are associated with which values of the predictor variables.

  • Data Mining Methods and Definitions
    • Data mining methods may be categorized as either supervised or unsupervised.
    • Most data mining methods are supervised methods.
    • Unsupervised : Clustering, PCA, Factor Analysis, Association Rules, RFM
    • Supervised :
      • Regression (Continuous Target) : Linear Regression, Regularised Regression, Decision trees, Ensemble learning
        • Linear Regression : Ridge, Lasso and Elastic Regression
        • Ensemble learning : Bagging, Boosting (AdaBoost, XGBoost), Random forests
      • Classification (Categorical Target) : Decision trees, Ensemble learning, Logistic Regression, k-nearest neighbor (k-NN), Naive-Bayes
      • Deep Learning : Neural Networks

41.4 In estimation, we approximate the value of a numeric target variable using a set of numeric and/or categorical predictor variables. Methods: Point Estimation, Confidence Interval Estimation, Simple Linear Regression, Correlation, Multiple Regression etc.

41.6 Classification is similar to estimation, however, instead of approximating the value of a numeric target variable, the target variable is categorical.

  • Question: Why so many different types of algorithms
    • We would need to apply all of them one by one on our dataset and would need to analyse which of them provide better performance measure on our specific dataset
  • Question: What is the performance measure
    • We calculate the predictive power /accuracy of various algorithms
    • Example: Cross-validation (Train & Test)
  • Question: AutoML functionality provided by various tools apply various algorithms on the dataset and suggest the best possible Algorithm. Are there any drawbacks or limitations of that
    • Mostly Proprietary, EDA and Pre-processing is difficult, limitations of what they allow
    • R, Python : No restriction, Noone else has access to your data
    • Longterm availability and applicability of past learnings based on your specific datasets
    • Automated R Package: rattle: Easy but many restrictions
    • Kaggle : Many competitions happen based on different algorithms

16.11 Groceries (Visualisations)

set.seed(3)
data(Groceries)
xw <- Groceries
#
rr_sup <- 0.001
rr_conf <- 0.5
rules <- apriori(xw, parameter=list(support = rr_sup, confidence = rr_conf))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target  ext
##         0.5    0.1    1 none FALSE            TRUE       5   0.001      1     10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [5668 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
#
# #Number of Rules
attributes(summary(rules))$length
## [1] 5668
#
# #Filter Rules with low confidence score.
subrules <- rules[quality(rules)$confidence > 0.8]
attributes(summary(subrules))$length
## [1] 371
#
# #Top10 for Graph Based visualisation
t10rules <- head(rules, n = 10, by = "lift")
ERROR 16.3 Error: ’plot’ is not an exported object from ’namespace:arulesViz’
  • Package ‘arulesViz’ provides S3 method for plot(). To avoid this overloading we can explicitly call the function like arulesViz::plot(). However, this is NOT an exported function.
  • So, we can use the ::: operator. For a package ‘pkg’ :
    • pkg::name returns the value of the exported variable ‘name’ in namespace ‘pkg’
    • pkg:::name returns the value of the internal variable ‘name.’
    • The package namespace will be loaded if it was not loaded before the call, but the package will not be attached to the search path.
    • i.e. arulesViz:::plot.rules() - Note the intenal function name is NOT plot().

16.11.1 ScatterPlot

A straight-forward visualization of association rules is to use a scatter plot with two interest measures on the axes. We can see that rules with high lift have typically a relatively low support. It has been argued that the most interesting rules reside on the support /confidence border.

  • All 5 quality measures are available for scatterplot i.e. support, confidence, coverage, lift, count.
  • Other measures can be added by interestMeasure() and then plotted

Image

(B24P04 B24P05) Groceries: Support, Confidence and Lift(B24P04 B24P05) Groceries: Support, Confidence and Lift

Figure 16.4 (B24P04 B24P05) Groceries: Support, Confidence and Lift

Code

hh <- rules
cap_hh <- "B24P04"
ttl_hh <- "Groceries: Support vs Confidence with Shading by Lift"
#
if(FALSE) {# #Accessing Internal Plotting Method plot.rules()
  B24 <- suppressMessages(
  arulesViz:::plot.rules(hh, measure = c("support", "confidence"), shading = "lift")) +
  labs(subtitle = NULL, caption = cap_hh, title = ttl_hh)
}
if(FALSE){# #Help for different methods 
  plot(hh, method = "paracoord", control = "help")  
}
#
B24 <- suppressMessages(
  plot(hh, measure = c("support", "confidence"), shading = "lift")) +
  labs(subtitle = NULL, caption = cap_hh, title = ttl_hh)
assign(cap_hh, B24)
rm(B24)
#
# #S3 Method Used by arulesViz::plot()
#Registered S3 methods overwritten by 'registry':
#  method               from 
#  print.registry_field proxy
#  print.registry_entry proxy
hh <- rules
cap_hh <- "B24P05"
ttl_hh <- "Groceries: Support vs Lift with Shading by Confidence"
#
B24 <- suppressMessages(
  plot(hh, measure = c("support", "lift"), shading = "confidence")) +
  labs(subtitle = NULL, caption = cap_hh, title = ttl_hh)
assign(cap_hh, B24)
rm(B24)

16.11.2 Two-key Plot

Here support and confidence are used for the x and y-axes and the color of the points is used to indicate “order,” i.e., the number of items contained in the rule (Both LHS and RHS).

It would be better to plot this with all rules.

From the plot it is clear that order and support have a very strong inverse relationship, which is a known fact for association rules.

Image

(B24P06) Groceries: Two-key Plot: Support and Confidence with Items

Figure 16.5 (B24P06) Groceries: Two-key Plot: Support and Confidence with Items

Code

hh <- rules
cap_hh <- "B24P06"
ttl_hh <- "Groceries: Two-key Plot"
#
B24 <- suppressMessages(
  plot(hh, method = "two-key plot")) +
  labs(subtitle = NULL, caption = cap_hh, title = ttl_hh)
assign(cap_hh, B24)
rm(B24)

16.11.3 Matrix Plot

2D matrix is used and the interest measure is represented by color shading of squares at the intersection. For this type of visualization the number of rows/columns depends on the number of unique itemsets in the consequent/antecedent in the set of rules.

Image

(B24P07) Groceries: Matrix

Figure 16.6 (B24P07) Groceries: Matrix

Code

hh <- subrules #rules
cap_hh <- "B24P07"
ttl_hh <- "Groceries: Matrix"
#
B24 <- plot(hh, method = "matrix", measure = "lift") +
  labs(subtitle = NULL, caption = cap_hh, title = ttl_hh)
assign(cap_hh, B24)
rm(B24)

16.11.4 Grouped Matrix

Matrix-based visualization is limited in the number of rules it can visualize effectively since large sets of rules typically also have large sets of unique antecedents /consequents. We can enhance matrix-based visualization using grouping of rules via clustering to handle a larger number of rules.

Grouped rules are presented as an aggregate in the matrix and can be explored interactively by zooming into and out of groups.

  • To group the column vectors fast and efficient into k groups we use k-means clustering.
  • The default interest measure used is lift.
  • The idea is that antecedents that are statistically dependent on the same consequents are similar and thus can be grouped together.
    • Compared to other clustering approaches for itemsets, this method enables us to even group antecedents containing substitutes (e.g., butter and margarine) which are rarely purchased together since they will have similar dependence to the same consequents.
  • To visualize the grouped matrix we use a balloon plot with antecedent groups as columns and consequents as rows.
    • The color of the balloons represent the aggregated interest measure in the group with a certain consequent and the size of the balloon shows the aggregated support.
    • The default aggregation function is the median value in the group.
    • The number of antecedents and the most important (frequent) items in the group are displayed as the labels for the columns.
    • Furthermore, the columns and rows in the plot are reordered such that the aggregated interest measure is decreasing from top down and from left to right, placing the most interesting group in the top left corner.
  • The group of most interesting rules according to lift (the default measure) are shown in the top-left corner of the plot.
    • There are 3 rules which contain “Instant food products” and up to 2 other items in the antecedent and the consequent is “hamburger meat.”
    • To increase the number of groups we can change k which defaults to 20.

Image

(B24P08 B24P09) Groceries: Grouped Matrix with default k = 20 and 50(B24P08 B24P09) Groceries: Grouped Matrix with default k = 20 and 50

Figure 16.7 (B24P08 B24P09) Groceries: Grouped Matrix with default k = 20 and 50

Code

hh <- rules #rules #subrules 
cap_hh <- "B24P09"
ttl_hh <- "Groceries: Grouped Matrix with k = 50"
#
B24 <- plot(hh, method = "grouped", control = list(k = 50)) +
  labs(subtitle = NULL, caption = cap_hh, title = ttl_hh)
assign(cap_hh, B24)
rm(B24)

16.11.5 Graph Based

  • Graph-based techniques visualize association rules using vertices and edges where vertices annodated with item labels represent items, and itemsets or rules are reptesented as a second set of vertices.
    • Items are connected with itemsets/rules using arrows.
    • For rules arrows pointing from items to rule vertices indicate LHS items and an arrow from a rule to an item indicates the RHS.
    • Interest measures are typically added to the plot by using color or size of the vertices representing the itemsets/rules.
    • Graph-based visualization offers a very clear representation of rules but they tend to easily become cluttered and thus are only viable for very small sets of rules.
    • For the following plots we select the 10 rules with the highest lift.
    • The following plot represents items and rules as vertices connecting them with directed edges.
    • This representation focuses on how the rules are composed of individual items and shows which rules share items.
    • By default ‘igraph’ is being used by ‘arulesViz.’

Image

(B24P10 B24P11) Groceries: Graph of Top 1o Rules by Lift(B24P10 B24P11) Groceries: Graph of Top 1o Rules by Lift

Figure 16.8 (B24P10 B24P11) Groceries: Graph of Top 1o Rules by Lift

Code Graph

hh <- t10rules #subrules #rules
cap_hh <- "B24P10"
ttl_hh <- "Groceries: Graph of Top 10 Rules by Lift"
#
B24 <- plot(hh, method = "graph") +
  labs(subtitle = NULL, caption = cap_hh, title = ttl_hh)
assign(cap_hh, B24)
rm(B24)

Code Circle

hh <- t10rules #subrules #rules
cap_hh <- "B24P11"
ttl_hh <- "Groceries: Circle Graph of Top 10 Rules by Lift"
loc_png <- paste0(.z$PX, "B24P11", "-Groceries-Graph-Circle", ".png")
#
if(!file.exists(loc_png)) {
  png(filename = loc_png) 
  #dev.control('enable') 
  plot(hh, method = "graph", engine = "igraph", 
       control = list(layout = igraph::in_circle()), main = NULL) 
  title(main = ttl_hh, line = 2, adj = 0)
  title(sub = cap_hh, line = 4, adj = 1)
  B24 <- recordPlot()
  dev.off()
  assign(cap_hh, B24)
  rm(B24)
}

16.11.6 Parallel coordinates plot

  • Parallel coordinates plots are designed to visualize multidimensional data where each dimension is displayed separately on the x-axis and the y-axis is shared.
    • Each data point is represented by a line connecting the values for each dimension.
    • Items are on the y-axis as nominal values and the x-axis represents the positions in a rule, i.e., first item, second item, etc.
    • Instead of a simple line an arrow is used where the head points to the consequent item.
    • Arrows only span enough positions on the x-axis to represent all the items in the rule, i.e., rules with less items are shorter arrows.
    • The width of the arrows represents support and the intensity of the color represent confidence.
    • The number of crossovers can be significantly reduced by reordering the items on the y-axis.

Image

(B24P12 B24P13) Groceries: Parallel Coordinates Plots (Default and Reordered)(B24P12 B24P13) Groceries: Parallel Coordinates Plots (Default and Reordered)

Figure 16.9 (B24P12 B24P13) Groceries: Parallel Coordinates Plots (Default and Reordered)

Code

hh <- t10rules #subrules #rules
cap_hh <- "B24P13"
ttl_hh <- "Groceries: Parallel Coordinates Plot"
loc_png <- paste0(.z$PX, "B24P13", "-Groceries-Paracoord-Reorder", ".png")
#
if(!file.exists(loc_png)) {
  png(filename = loc_png) 
  #dev.control('enable') 
  plot.new()
  plot(hh, method = "paracoord", control = list(reorder = TRUE), main = ttl_hh) 
  title(sub = cap_hh, line = 4, adj = 1)
  B24 <- recordPlot()
  dev.off()
  assign(cap_hh, B24)
  rm(B24)
}

16.11.7 Double Decker plots

  • A double decker plot is a variant of a mosaic plot.
    • A mosaic plot displays a contingency table using tiles on a rectangle created by recursive vertical and horizontal splits.
      • The size of each tile is proportional to the value in the contingency table.
    • Double decker plots use only a single horizontal split.
      • To visualize a single association rule.
      • Here the displayed contingency table is computed for a rule by counting the occurrence frequency for each subset of items in the antecedent and consequent from the original data set.
      • The items in the antecedent are used for the vertical splits and the consequent item is used for horizontal highlighting.
  • The area of blocks gives the support and the height of the “yes” blocks is proportional to the confidence for the rules consisting of the antecedent items marked as “yes.”
  • Items that show a significant jump in confidence when changed from “no” to “yes” are interesting.

Image

(B24P14) Groceries: Double Decker Plot of Single Rule

Figure 16.10 (B24P14) Groceries: Double Decker Plot of Single Rule

Code

# #Double Decker Plots need original dataset also (xw)
# #Select One of the Rule
set.seed(3)
hh <- sample(rules, 1) 
inspect(hh)
#
cap_hh <- "B24P14"
ttl_hh <- "Groceries: Double Decker Plot of Single Rule"
loc_png <- paste0(.z$PX, "B24P14", "-Groceries-Double-Decker", ".png")
#
if(!file.exists(loc_png)) {
  png(filename = loc_png) 
  #dev.control('enable') 
  plot.new()
  plot(hh, method = "doubledecker", data = xw, main = ttl_hh)
  title(sub = cap_hh, line = 0, adj = 1)
  B24 <- recordPlot()
  dev.off()
  assign(cap_hh, B24)
  rm(B24)
}

16.11.8 From Rules to Graph

# #convert rules into a graph with rules as nodes
hh <- associations2igraph(rules)
if(FALSE) plot(hh)
#
# #convert the graph into a tidygraph
if(FALSE) {
  #library("tidygraph")
  as_tbl_graph(hh)
  #Error: `trunc_mat()` was deprecated in tibble 3.1.0. 
  #Printing has moved to the pillar package.
}
# #convert the generating itemsets of the rules into a graph with itemsets as edges
itemsets <- generatingItemsets(rules)
hh <- associations2igraph(itemsets, associationsAsNodes = FALSE)
if(FALSE) plot(hh, layout = igraph::layout_in_circle)
#
# #save rules as a graph so they can be visualized using external tools
if(FALSE) saveAsGraph(rules, "rules.graphml")

Validation


17 Regresssion (B25, Dec-26)

17.1 Overview

  • “Supervised Learning Algorithm: Regression”

17.2 Packages

if(FALSE){# #WARNING: Installation may take some time.
  install.packages("fastDummies", dependencies = TRUE)
  install.packages("carData", dependencies = TRUE)
}

17.3 Linear Regression

36.4 The simplest type of regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line, is called simple linear regression.

  • \({Y}\): Scalar response, Dependent variable, Outcome variable, Target, Predicted

  • \({X}\): Explanatory variables, Independent variables, Antecendent variables, Predictors

  • It is applied when the objective is to predict the outcome variable based on the antecendent variables

  • Predicted (Y) should be continuous, but Predictors (X) can be either continuous or categorical

    • Ex: Salary (Y) is a function of Age, Gender, Education, Years of Experience
    • Ex: Consumption (Y) is a function of Income (X)
    • We are interested in \(Y = mX + C\) where \({m}\) is the slope of the line and C is the y-intercept
      • Slope: What is the change in Y for unit change in X
    • Because there will be some error \({\epsilon}\), thus the equation is given by \(y = {\alpha} + {\beta}x + {\epsilon}\) where \({\alpha}\) is the average Y when X are zero.
    • NOTE Equation can also be given by \({y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon}\). In that case assume \({\alpha} = {\beta}_0\) and \({\beta} = {\beta}_1\).
  • Question: \({\alpha}\) is constant for a given set of data

    • Yes
  • Refer Slope is tested for Significance

  • Simple Linear Regression is also known as Bivariate Regression. i.e. Single Y, Single X

  • When there are Single Y and Multiple X, it is called Multiple Linear Regression

    • Equation: \({y} = {\beta}_0 + {\beta}_1 {x}_1 + {\beta}_2 {x}_2 + \ldots + {\epsilon}\)
  • lm()

    • Base R Function to run linear model or regression
    • Tilde “~” means regressed on i.e. Dependent (Y) ~ Independents (X)
      • “linear regression model of y on x”
    • Models for lm are specified symbolically. A typical model has the form response   terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response.
    • summary()
      • The standard error is variability to expect in coefficient which captures sampling variability so the variation in intercept can be up 0 and variation in IncomeX will be 0 not more than that
      • t-value: t value is Coefficient divided by standard error.
        • It is basically how big is estimated relative to error.
        • Bigger the coefficient relative to Std. error the bigger the t score and t score comes with a p-value because it is a distribution.
      • p-value is how statistically significant the variable is to the model for a confidence level of 95%
        • If the p-value is less than alpha (0.05) for both intercept and X then it implies that both are statistically significant to our model.
      • Residual standard error or the standard error of the model is basically the average error for the model. It means the average value by which our model can deviate while predicting the Y.
        • Lesser the error the better the model while predicting.
      • Multiple R-squared is the ratio of (1-(sum of squared error/sum of squared total))
      • Adjusted R-squared:
        • If we add variables no matter if its significant in prediction or not the value of R-squared will increase which the reason Adjusted R-squared is used because if the variable added is not significant for the prediction of the model the value of Adjusted R-squared will reduce, it is one of the most helpful tools to avoid overfitting of the model.
    • F-statistic is the ratio of the mean square of the model and mean square of the error, in other words, it is the ratio of how well the model is doing and what the error is doing, and the higher the F value is the better the model is doing as compared to the error.

lm()

bb <- tibble(ConsumptionY = seq.int(80, by = 20, length.out = 5),
             IncomeX = seq.int(100, by = 100, length.out = 5))
#
# #Build the model
mod_bb <- lm(formula = ConsumptionY ~ ., data = bb)
#
# #Model
suppressWarnings(mod_bb)
## 
## Call:
## lm(formula = ConsumptionY ~ ., data = bb)
## 
## Coefficients:
## (Intercept)      IncomeX  
##        60.0          0.2
#
# #Summarise the model
if(FALSE) suppressWarnings(summary(mod_bb))
#
# #ANOVA Table
if(FALSE) suppressWarnings(anova(mod_bb))

Model

names(mod_bb)
##  [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"       
##  [7] "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"
#
# #Coefficients (Model Parameters): Different headers than summary(mod_bb)$coefficients 
mod_bb$coefficients #coefficients(mod_bb)
## (Intercept)     IncomeX 
##        60.0         0.2
#
# #Residuals
mod_bb$df.residual
## [1] 3
#
f_pNum(mod_bb$residuals) #residuals(mod_bb) #summary(mod_bb)$residuals 
## 1 2 3 4 5 
## 0 0 0 0 0
#
# #What is Effects
f_pNum(mod_bb$effects) #effects(mod_bb)
## (Intercept)     IncomeX                                     
##     -268.33       63.25        0.00        0.00        0.00
#
# #Rank
mod_bb$rank
## [1] 2
#
# #Fitted Values
mod_bb$fitted.values #fitted.values(mod_bb)
##   1   2   3   4   5 
##  80 100 120 140 160
#
# #Assign
mod_bb$assign
## [1] 0 1
#
# #qr
mod_bb$qr[[1]] %>% as_tibble()
## # A tibble: 5 x 2
##   `(Intercept)`  IncomeX
##           <dbl>    <dbl>
## 1        -2.24  -671.   
## 2         0.447  316.   
## 3         0.447   -0.195
## 4         0.447   -0.512
## 5         0.447   -0.828
#
# #Others
if(FALSE) mod_bb$xlevels
if(FALSE) mod_bb$call #summary(mod_bb)$call
if(FALSE) mod_bb$terms
#
mod_bb$model
##   ConsumptionY IncomeX
## 1           80     100
## 2          100     200
## 3          120     300
## 4          140     400
## 5          160     500

Summary

#summary(mod_bb)
names(suppressWarnings(summary(mod_bb)))
##  [1] "call"          "terms"         "residuals"     "coefficients"  "aliased"       "sigma"        
##  [7] "df"            "r.squared"     "adj.r.squared" "fstatistic"    "cov.unscaled"
# [1] "call"          "terms"         "residuals"     "coefficients"  "aliased"       "sigma"        
# [7] "df"            "r.squared"     "adj.r.squared" "fstatistic"    "cov.unscaled" 
#
# #Coefficients: Different headers than mod_bb$coefficients
f_pNum(suppressWarnings(summary(mod_bb))$coefficients) %>% as_tibble()
## # A tibble: 2 x 4
##   Estimate `Std. Error` `t value` `Pr(>|t|)`
##      <dbl>        <dbl>     <dbl>      <dbl>
## 1     60              0   6.24e15          0
## 2      0.2            0   6.89e15          0
#
#
# #Better Printing
if(FALSE) f_pNum(summary(mod_bb)$coefficients) %>% as_tibble(rownames = "DummyParVsRef") %>% 
  rename(pVal = "Pr(>|t|)") %>% 
  mutate(pVal = ifelse(pVal < 0.001, 0, pVal), isSig = ifelse(pVal < 0.05, TRUE, FALSE))
#
# #R^2 and Adjusted R^2
suppressWarnings(summary(mod_bb))$r.squared
## [1] 1
suppressWarnings(summary(mod_bb))$adj.r.squared
## [1] 1
#
# #F-Statistic
suppressWarnings(summary(mod_bb))$fstatistic 
##                            value                            numdf                            dendf 
## 47536897508558620572600222024064                                1                                3
#
# #Covariance
suppressWarnings(summary(mod_bb))$cov.unscaled
##             (Intercept)  IncomeX
## (Intercept)       1.100 -0.00300
## IncomeX          -0.003  0.00001
#
# #Sigma
f_pNum(suppressWarnings(summary(mod_bb))$sigma)
## [1] 0

17.4 Prediction

  • Question: After generating the equation can we calculate what would be the Consuption if the Income is 600
    • The model can be used to predict the dependent variable
    • (Aside) Caution: Prediction beyond the values of min(x) and max(x) of original dataset is inference and it is discouraged. i.e. Y for X= 150 can be predicted but Y for X >500 should be avoided.
# #Predict the Outcome variable using the Model
test_bb <- tibble(IncomeX = c(150, 600, 700))
#res_bb <- predict(mod_bb, test_bb)
res_bb <- test_bb %>% mutate(ConsumptionY = predict(mod_bb, .))
res_bb
## # A tibble: 3 x 2
##   IncomeX ConsumptionY
##     <dbl>        <dbl>
## 1     150           90
## 2     600          180
## 3     700          200

17.5 MS Excel: Regression Analysis

  • We need Data Analysis AddIn.
  • In Windows 10 & Microsoft Excel 2016
    • Menu | File | Options | Add-ins | Manage = Excel Add-ins | Go | Add-ins Popup
      • Tick the Analysis ToolPak | Go
      • As shown by Professr, the Sequence might be Menu | File | More | Options | …
    • Confirmation
      • Menu | Data |
        • Right Most section would have been added called “Analysis” and it will have one button “Data Analysis”
  • Regression Analysis
    • Enter the data
      • Menu | Data | Analysis | Data Analysis | Popup | Regression | OK
      • Select Input Y Range | Select Input X Range | Tick Labels | OK
(B25P01) Regression in MS Excel

Figure 17.1 (B25P01) Regression in MS Excel

(B25P02) Regression Result in MS Excel

Figure 17.2 (B25P02) Regression Result in MS Excel

17.6 Explanation of Terms

36.22 \(\text{\{Test for Significance in Simple Linear Regression\} } {H_0} : {\beta}_1 = 0 \iff {H_a}: {\beta}_1 \neq 0\)

  • t-value:
    • The slope model parameter \((\beta_1)\) needs to be tested for significance.
    • Refer Slope is tested for Significance
    • \(t = \frac{b_1}{s_{b_1}}\)
    • If \({}^2\!P_{(t)} \leq {\alpha} \to {H_0}\) Rejected.

36.21 Standard deviation of \(b_1\) is \({\sigma}_{b_1}\). Its estimate, estimated standard deviation of \(b_1\), is given by \(s_{b_1} = \frac{s}{\sqrt{\Sigma (x_i - {\overline{x}})^2}}\). The standard deviation of \(b_1\) is also referred to as the standard error of \(b_1\). Thus, \(s_{b_1}\) provides an estimate of the standard error of \(b_1\).

  • Question: In case of multiple variables, do we need to do this for all variables
    • No, we will not do it individually. We will do multiple regression. We will incorporate all the variables simultaneously.
    • In multiple regression, we get \(\{\beta_1, \beta_2, \beta_3, \ldots \}\) and each will have its own standard error i.e. \(\{s_{b_1}, s_{b_2}, s_{b_3}, \ldots \}\).
    • If a variable is NOT signifact, it means that it is not contributing to the model in a meaningful manner.
  • F-Statistic:
    • For simple linear regression i.e. single X, single Y; the F-test and t-test provide same result. However, in multiple regression model F-test is used as the test for overall significance and t-tests are used as tests for individual significance.

37.5 \(\text{\{F-Test in Multiple Linear Regression\} } {H_0} : {\beta}_1 = {\beta}_2 = \cdots = {\beta}_p = 0 \iff {H_a}: \text{At least one parameter is not zero}\)

37.6 \(\text{\{t-Test in Multiple Linear Regression\} } {H_0} : {\beta}_i = 0 \iff {H_a}: {\beta}_i \neq 0\)

  • Question: Which one should come first Joint Significance or Individual Significance
    • In multiple regression analysis, the joint significance needs to be looked at first.
    • i.e. First the variables jointly able to predict the outcome variable then later on we can check contribution of individual variable
    • If the model is not good then there is no point in looking at individual variables
    • F should be greater than 0.05 for us to consider that model is valid.
  • Question: Can we drop the variables which are not contributing much to the model
    • We may find that out of 4 independent variable A, B, C, D; C & D are not contributing much to the model. We can drop C & D. However, there is a possibility that A & B are performing well because of the presence of C & D. Though C & D contribution, by itself, is insignificant, it makes A & B contribution significant. C or D might be influencing A or B.
    • If there are high number of variables e.g. 20 or 30, then we will drop the insignificant variables because the model complexity becomes an issue.
      • Practically how many independent variable we can handle in our business case is also a consideration
  • Question: Here we have single \(\beta_1\) then how the F-test is applied
    • We are only checking \(\beta_1 = 0\)
  • Any model is a good model if it has minimum number of predictors and maximum predictive power.

36.14 The ratio \(r^2 =\frac{\text{SSR}}{\text{SST}} \in [0, 1]\), is used to evaluate the goodness of fit for the estimated regression equation. This ratio is called the coefficient of determination (\(r^2\)). It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation.

  • \(r^2\) is the ‘Coefficient of Determination’ or ‘Goodness of Fit’
    • \(r^2 = 1\) means model is able to explain 100% relationship between independent and dependent variables.
    • \(r^2 = 0\) means model is not able to explain anything

36.6 The random variable, error term \(({\epsilon})\), accounts for the variability in \({y}\) that cannot be explained by the linear relationship between \({x}\) and \({y}\).

  • Error Term \(\epsilon\) denotes unexplained variance

  • Question: What would be the acceptable value of \(r^2\)

    • No rule, context dependent
    • In general, we want \(r^2\) to be as high as possible
    • (Aside) Further, there is a ‘model overfitting’ concern also with very high \(r^2\)
  • Total = Explained (by independent variables) + Unexplained (\(\epsilon\))

    • Join Significance is about all the independent variables
  • Question: Is there a possibility of situation where independent variables are not performing jointly but \(r^2\) is high

    • It is possible
    • There might be some internal issue which results in high \(r^2\) but low model performance
    • If the predictors are highly correlated (Multicollinearity)
      • Individual performance gets reduced, however, the \(r^2\) value might increase
      • Multicollinearity reduces the robustness of model
      • Individually the either variable can explain the dependent variable very well. However, due to multicollinearity, together they fail to perform at same level.
      • Multicollinearity is not a problem between depdendent and independent variable. It is applicable only between independent variables.
    • Regression analysis require that the variables should be non-overlapping (high correlation should not be there)

44.1 Multicollinearity is a condition where some of the predictor variables are strongly correlated with each other.

  • Question: How do we deal with Multicollinearity and other such problems
    • First of all, you need to select those independent variables which are not correlated to each other
  • Multiple R (in Excel)
    • Average correlation between all the variables (dependent and independent)

37.4 If a variable is added to the model, \(R^2\) becomes larger even if the variable added is not statistically significant. The adjusted multiple coefficient of determination \((R_a^2)\) compensates for the number of independent variables in the model. With ‘n’ denoting the number of observations and ‘p’ denoting the number of independent variables: \(R_a^2 = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}\)

  • Adjusted R square \(R_a^2\)
    • If we want to compare different models (and specially those with different number of independent variables), we should use Adjusted \(R_a^2\)

17.7 Application of Regression

  • Two Applications of Regression
    • Estimation (Descriptive)
      • Which of these independent variables significantly affect the dependent variable
      • e.g. Which of these factors are influencing the employee performance
      • When we are doing estimation, data partition into train and test datasets is not required.
    • Prediction
      • Partition the Sample data randomly into Train and Test datasets in ratio of 80:20, 70:30 etc.
      • Do not predict beyond the min(x) and max(x) range because no data is available outside these limits.
      • Validation
        • Randomly Partition data | Build the Model on Train | Run it on Test | For Test we will have actual Y and predicted Y \((\hat{y})\) | Evaluate the difference between Actual and Predicted \({(y_i - \hat{y}_i)}\)
        • The error between Actual and Predicted is called loss function.
        • For all the models do this and compare them and select the one which have lowest loss function.
  • Question: More number of observations in Train dataset would lead to better model
    • Yes. That is why Train would have the major chunk. For small datasets it would be around 80%
  • Question: What is the drawback if Train contains 80% for large datasets
    • No issue.
    • Partition should be done randomly, otherwise no limitation.

17.8 Bias-Variance Tradeoff

45.5 Even though the high-complexity model has low bias (error rate), it has a high variance; and even though the low-complexity model has a high bias, it has a low variance. This is known as the bias-variance trade-off. It is another way of describing the overfitting-underfitting dilemma.

44.3 Overfitting is the production of an analysis that corresponds too closely to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.

17.9 Categorical Independent Variables

37.8 A categorical variable with \(k\) levels must be modeled using \(k - 1\) dummy variables (or indicator variables). It can take only the values 0 and 1. e.g. A variable with 3 levels of {low, medium, high} would need 2 dummy variables \(\{x_1, x_2\}\) each being either 0 or 1 only. i.e. low \(\to \{x_1 = 1, x_2 = 0\}\), medium \(\to \{x_1 = 0, x_2 = 1\}\), high \(\to \{x_1 = 0, x_2 = 0\}\). Thus \(x_1\) is 1 when low and 0 otherwise, \(x_2\) is 1 when medium and 0 otherwise. High is represented as neither \(x_1\) nor \(x_2\) i.e. both are zero. Note that both cannot be 1. Only one of them can be TRUE at a time.

  • When the independent variable is categorical e.g. Gender (M, F)
    • Then “one unit change in X” is not applicable
    • It is more about “change in state of X” from one level to another.
    • Convert Categorical Independent Variable into Dummy Variables.
    • The category that is not assigned an indicator variable is denoted the reference category (or the Benchmark).
      • If ‘M’ is assigned 0 then it will be the benchmark
      • If ‘F’ is assigned 0 then it will be the benchmark
      • Performance of all other levels of the variable is given as compared to the referenced level
    • In the example above ‘high’ is the benchmark
    • “dummy coding” leads to the creation of a table called contrast matrix.
  • Question: What if we have a dataset with both categorical and continuoud dependent varaibles
    • Convert categorical to dummies, no need to do anything with continous variables

17.10 Example

ERROR 17.1 Warning messages: In model.response(mf, "numeric") : using type = "numeric" with a factor response will be ignored. In Ops.factor(y, ...) : ’-’ not meaningful for factors
  • Running summary() on this Model will result in Error
    • "Error in quantile.default(resid) : (unordered) factors are not allowed"
    • "In addition: Warning message: In Ops.factor(r, 2) : ’^’ not meaningful for factors"
  • Dependent Variable is Non-numeric. Found to be Factor in this case.
  • Y needs to be Numeric for Regression Analysis
# #Create Data Set with Factor Column having the its First Level as the Reference for Later
bb <- tibble(Performance = c(35, 36, 40, 45, 60, 66, 67, 78, 80, 87, 78, 89, 89, 90),
             Class = factor(c(rep("A", 4), rep("B", 6), rep("C", 4)), levels = c("C", "B", "A")))
#
# #Create Dummies | Drop Original | Drop Reference Variable 
dum_bb <- dummy_cols(bb, select_columns = "Class", 
                     remove_selected_columns = TRUE, remove_first_dummy = TRUE)
str(dum_bb)
## tibble [14 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Performance: num [1:14] 35 36 40 45 60 66 67 78 80 87 ...
##  $ Class_B    : int [1:14] 0 0 0 0 1 1 1 1 1 1 ...
##  $ Class_A    : int [1:14] 1 1 1 1 0 0 0 0 0 0 ...
#
mod_bb <- lm(Performance ~ ., data = bb)
if(FALSE) summary(mod_bb)$coefficients
# #Better Printing
if(TRUE) f_pNum(summary(mod_bb)$coefficients) %>% as_tibble(rownames = "DummyParVsRef") %>% 
  rename(pVal = "Pr(>|t|)") %>% 
  mutate(pVal = ifelse(pVal < 0.001, 0, pVal), isSig = ifelse(pVal < 0.05, TRUE, FALSE))
## # A tibble: 3 x 6
##   DummyParVsRef Estimate `Std. Error` `t value`  pVal isSig
##   <chr>            <dbl>        <dbl>     <dbl> <dbl> <lgl>
## 1 (Intercept)       86.5         3.94     22.0  0     TRUE 
## 2 ClassB           -13.5         5.09     -2.65 0.022 TRUE 
## 3 ClassA           -47.5         5.57     -8.53 0     TRUE
#
# #Anova Table 
if(FALSE) anova(mod_bb) 
if(TRUE) anova(mod_bb) %>% as_tibble(rownames = "Predictors") %>% 
  rename(pVal = "Pr(>F)") %>% 
  mutate(pVal = ifelse(pVal < 0.001, 0, pVal), isSig = ifelse(pVal < 0.05, TRUE, FALSE))
## # A tibble: 2 x 7
##   Predictors    Df `Sum Sq` `Mean Sq` `F value`  pVal isSig
##   <chr>      <int>    <dbl>     <dbl>     <dbl> <dbl> <lgl>
## 1 Class          2    4873.    2436.       39.2     0 TRUE 
## 2 Residuals     11     683       62.1      NA      NA NA

17.11 fastDummies

  • dummy_cols()
    • To generate dummy columns
set.seed(3) 
bb <- tibble(Performance = sample(1:100, size = 20),
             Grade = sample(LETTERS[1:3], size = 20, replace = TRUE))
str(bb, vec.len = 10)
## tibble [20 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Performance: int [1:20] 5 58 12 36 99 95 8 20 74 55 40 48 94 37 66 29 100 87 9 82
##  $ Grade      : chr [1:20] "B" "B" "A" "B" "C" "B" "C" "B" "C" "A" ...
#
# #Convert character to dummy columns 
if(FALSE) dum_bb <- dummy_cols(bb, select_columns = c("Grade"))
# #To keep only (k-1) columns to avoid multicollinearity
if(FALSE) dum_bb <- dummy_cols(bb, select_columns = c("Grade"),  
                              remove_first_dummy = FALSE, remove_selected_columns = TRUE)
if(TRUE) dum_bb <- dummy_cols(bb, select_columns = c("Grade"), 
                              remove_most_frequent_dummy = TRUE, remove_selected_columns = TRUE)
str(dum_bb, vec.len = 10)
## tibble [20 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Performance: int [1:20] 5 58 12 36 99 95 8 20 74 55 40 48 94 37 66 29 100 87 9 82
##  $ Grade_A    : int [1:20] 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0
##  $ Grade_C    : int [1:20] 0 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1
#
# #Multiple Linear Regression with Categorical to Dummy Variables
if(FALSE) mod_bb <- lm(formula = Performance ~ Grade_A + Grade_B, data = dum_bb)
if(TRUE) mod_bb <- lm(formula = Performance ~ ., data = dum_bb)
#
# #Model
mod_bb
## 
## Call:
## lm(formula = Performance ~ ., data = dum_bb)
## 
## Coefficients:
## (Intercept)      Grade_A      Grade_C  
##       61.87       -26.38        -9.75
#
# #Summarise the model
if(TRUE) summary(mod_bb)
## 
## Call:
## lm(formula = Performance ~ ., data = dum_bb)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56.88 -24.09  -4.00  30.03  46.88 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)    61.88      11.79   5.249 0.0000652 ***
## Grade_A       -26.38      20.42  -1.292     0.214    
## Grade_C        -9.75      16.67  -0.585     0.566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.34 on 17 degrees of freedom
## Multiple R-squared:  0.08959,    Adjusted R-squared:  -0.01751 
## F-statistic: 0.8365 on 2 and 17 DF,  p-value: 0.4503
#
# #ANOVA Table
if(TRUE) anova(mod_bb)
## Analysis of Variance Table
## 
## Response: Performance
##           Df  Sum Sq Mean Sq F value Pr(>F)
## Grade_A    1  1479.2 1479.20  1.3309 0.2646
## Grade_C    1   380.2  380.25  0.3421 0.5663
## Residuals 17 18894.8 1111.46

17.12 Data: CarDekho

  • Covered in Next Lecture.

17.13 Data: Salaries

Salaries

# #Load Data "Salaries". It is NOT included in "car". It is included in the "carData" Package
if(FALSE) data(package = "car")$results[ , "Item"]
data("Salaries", package = "carData")
str(Salaries)
## 'data.frame':    397 obs. of  6 variables:
##  $ rank         : Factor w/ 3 levels "AsstProf","AssocProf",..: 3 3 1 3 3 2 3 3 3 3 ...
##  $ discipline   : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
##  $ yrs.since.phd: int  19 20 4 45 40 6 30 45 21 18 ...
##  $ yrs.service  : int  18 16 3 39 41 6 23 45 20 18 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
##  $ salary       : int  139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...

Categorical X with 2 levels

# #y = b0 + b1 * x : Y (Salaries), X (Sex) with 2 levels
#
# #In R "Factor" Notation (Default Alphabetical Ordering): Female = 1, Male = 2 
levels(Salaries$sex)
## [1] "Female" "Male"
#
# #To encode categorical variables, known as 'contrast coding systems'. 
# #R can directly convert the Categorical Variable into dummy with Female = 0, Male = 1
# #The default option in R is to use the first level of the factor as a reference 
# #and interpret the remaining levels relative to this level.
# #contrasts() lists the dummy variables that would be created for k levels i.e. k-1 dummy 
contrasts(Salaries$sex)
##        Male
## Female    0
## Male      1
#
# #Compute the model
mod_sal_f <- lm(salary ~ sex, data = Salaries)
if(TRUE) f_pNum(summary(mod_sal_f)$coefficients) %>% as_tibble(rownames = "DummyParVsRef") %>% 
  rename(pVal = "Pr(>|t|)") %>% 
  mutate(pVal = ifelse(pVal < 0.001, 0, pVal), isSig = ifelse(pVal < 0.05, TRUE, FALSE))
## # A tibble: 2 x 6
##   DummyParVsRef Estimate `Std. Error` `t value`   pVal isSig
##   <chr>            <dbl>        <dbl>     <dbl>  <dbl> <lgl>
## 1 (Intercept)    101002.        4809.     21.0  0      TRUE 
## 2 sexMale         14088.        5065.      2.78 0.0057 TRUE
#
# #Interpretation of Coefficients with 'Female' as the reference level i.e. 0 within the dummy 
# # b0 : average salary among Female (Reference): (Intercept, Estimate) 101002
# # b0 + b1 : average salary among Male : (101002) + (14088) = 115090
# # b1 : average difference in salary of Male & Female (Reference): (sexMale, Estimate) 14088
#
# #The p-value is 0 (significant), suggesting that there is a statistical evidence 
# #of a difference in average salary between the genders
#
# #We can change the Factor Levels and thus change the Reference Variable
m_Salaries <- as_tibble(Salaries) %>% mutate(across(sex, factor, levels = c("Male", "Female")))
levels(m_Salaries$sex)
## [1] "Male"   "Female"
contrasts(m_Salaries$sex)
##        Female
## Male        0
## Female      1
#
mod_sal_m <- lm(salary ~ sex, data = m_Salaries)
if(TRUE) f_pNum(summary(mod_sal_m)$coefficients) %>% as_tibble(rownames = "DummyParVsRef") %>% 
  rename(pVal = "Pr(>|t|)") %>% 
  mutate(pVal = ifelse(pVal < 0.001, 0, pVal), isSig = ifelse(pVal < 0.05, TRUE, FALSE))
## # A tibble: 2 x 6
##   DummyParVsRef Estimate `Std. Error` `t value`   pVal isSig
##   <chr>            <dbl>        <dbl>     <dbl>  <dbl> <lgl>
## 1 (Intercept)    115090.        1587.     72.5  0      TRUE 
## 2 sexFemale      -14088.        5065.     -2.78 0.0057 TRUE
#
# #Interpretation of Coefficients with 'Male' as the reference level i.e. 0 within the dummy 
# # b0 : average salary among Male (Reference): (Intercept, Estimate) 115090
# # b0 + b1 : average salary among Female : (115090) + (-14088) = 101002
# # b1 : average difference in salary of Female & Male (Reference): (sexFemale, Estimate) -14088
#
# #The fact that the coefficient for sexFemale in the regression output is negative 
# #indicates that being a Female is associated with decrease in salary (relative to Male).
#
# #The p-value is 0 (significant)

Categorical X with 3 levels

# #y = b0 + b1 * x : Y (Salaries), X (Rank) with 3 levels
#
# #Change Factor Levels | Treating Professor as Reference Variable 
# #Ordering is done in decreasing Rank but NOT as 'Ordered Factor' for now
r_Salaries <- as_tibble(Salaries) %>% 
  mutate(across(rank, factor, levels = c("Prof", "AssocProf", "AsstProf")))
levels(r_Salaries$rank)
## [1] "Prof"      "AssocProf" "AsstProf"
#
# #contrasts() lists the dummy variables that would be created for k levels i.e. k-1 dummy 
# #Two dummies were created against the reference of "Prof"
contrasts(r_Salaries$rank)
##           AssocProf AsstProf
## Prof              0        0
## AssocProf         1        0
## AsstProf          0        1
#
mod_sal_r <- lm(salary ~ rank, data = r_Salaries)
if(TRUE) f_pNum(summary(mod_sal_r)$coefficients) %>% as_tibble(rownames = "DummyParVsRef") %>% 
  rename(pVal = "Pr(>|t|)") %>% 
  mutate(pVal = ifelse(pVal < 0.001, 0, pVal), isSig = ifelse(pVal < 0.05, TRUE, FALSE))
## # A tibble: 3 x 6
##   DummyParVsRef Estimate `Std. Error` `t value`  pVal isSig
##   <chr>            <dbl>        <dbl>     <dbl> <dbl> <lgl>
## 1 (Intercept)    126772.        1449.      87.5     0 TRUE 
## 2 rankAssocProf  -32896.        3290.     -10.0     0 TRUE 
## 3 rankAsstProf   -45996.        3231.     -14.2     0 TRUE
#
# #Interpretation of Coefficients with 'Prof' as the reference level i.e. {0 0} within the dummies
# # b0 : average salary among Prof (Reference): (Intercept) 126772
# # b0 + b1 : average salary among AssocProf : (126772) + (-32895) 
# # b0 + b2 : average salary among AsstProf  : (126772) + (-45996) 
# # b1 : average difference in salary of AssocProf & Prof (Reference): (rankAssocProf) -32895
# # b2 : average difference in salary of AsstProf & Prof (Reference): (rankAssocProf) -45996
#
# #The p-value is 0 (significant), suggesting that there is a statistical evidence 
# #of a difference in average salary between the ranks

# #The fact that the coefficient for rankAssocProf & rankAsstProf in the regression output are 
# #negative indicates that lower ranks are associated with lower salary (relative to Prof).

Complete Model

# #y = b0 + b1 * x : Y (Salaries), X (Rank) with 3 levels
#
# #Change Factor Levels | Reference: Rank = Professor, Sex = Male, Discipline = A
bb <- as_tibble(Salaries) %>% 
  mutate(across(sex, factor, levels = c("Male", "Female"))) %>% 
  mutate(across(rank, factor, levels = c("Prof", "AssocProf", "AsstProf")))
#
# #contrasts() lists the dummy variables that would be created for k levels i.e. k-1 dummy 
contrasts(bb$rank)
##           AssocProf AsstProf
## Prof              0        0
## AssocProf         1        0
## AsstProf          0        1
contrasts(bb$sex)
##        Female
## Male        0
## Female      1
contrasts(bb$discipline)
##   B
## A 0
## B 1
#
mod_bb <- lm(salary ~ ., data = bb)
if(TRUE) f_pNum(summary(mod_sal_r)$coefficients) %>% as_tibble(rownames = "DummyParVsRef") %>% 
  rename(pVal = "Pr(>|t|)") %>% 
  mutate(pVal = ifelse(pVal < 0.001, 0, pVal), isSig = ifelse(pVal < 0.05, TRUE, FALSE))
## # A tibble: 3 x 6
##   DummyParVsRef Estimate `Std. Error` `t value`  pVal isSig
##   <chr>            <dbl>        <dbl>     <dbl> <dbl> <lgl>
## 1 (Intercept)    126772.        1449.      87.5     0 TRUE 
## 2 rankAssocProf  -32896.        3290.     -10.0     0 TRUE 
## 3 rankAsstProf   -45996.        3231.     -14.2     0 TRUE
#
# #Anova Table of Base R
if(TRUE) anova(mod_bb) %>% as_tibble(rownames = "Predictors") %>% 
  rename(pVal = "Pr(>F)") %>% 
  mutate(pVal = ifelse(pVal < 0.001, 0, pVal), isSig = ifelse(pVal < 0.05, TRUE, FALSE))
## # A tibble: 6 x 7
##   Predictors       Df      `Sum Sq`    `Mean Sq` `F value`    pVal isSig
##   <chr>         <int>         <dbl>        <dbl>     <dbl>   <dbl> <lgl>
## 1 rank              2 143231765736. 71615882868.   141.     0      TRUE 
## 2 discipline        1  18429929986. 18429929986.    36.3    0      TRUE 
## 3 yrs.since.phd     1    165649329.   165649329.     0.326  0.568  FALSE
## 4 yrs.service       1   2576287631.  2576287631.     5.07   0.0249 TRUE 
## 5 sex               1    780676354.   780676354.     1.54   0.216  FALSE
## 6 Residuals       390 198116333525.   507990599.    NA     NA      NA
#
# #Anova Table of Car Package which automatically takes care of unbalanced designs
if(TRUE) Anova(mod_bb) %>% as_tibble(rownames = "Predictors") %>% 
  rename(pVal = "Pr(>F)") %>% 
  mutate(pVal = ifelse(pVal < 0.001, 0, pVal), isSig = ifelse(pVal < 0.05, TRUE, FALSE))
## # A tibble: 6 x 6
##   Predictors         `Sum Sq`    Df `F value`    pVal isSig
##   <chr>                 <dbl> <dbl>     <dbl>   <dbl> <lgl>
## 1 rank           69507674502.     2     68.4   0      TRUE 
## 2 discipline     19237331473.     1     37.9   0      TRUE 
## 3 yrs.since.phd   2504060819.     1      4.93  0.0270 TRUE 
## 4 yrs.service     2710023485.     1      5.33  0.0214 TRUE 
## 5 sex              780676354.     1      1.54  0.216  FALSE
## 6 Residuals     198116333525.   390     NA    NA      NA
#
# #Taking other variables into account, it can be seen that the categorical variable sex 
# #is no longer significantly associated with the variation in salary between individuals. 
# #Significant variables are rank and discipline.

anova() vs. Anova()

# #Anova Table of Base R (Type I : each variable is added in sequential order)
anova(mod_bb) %>% as_tibble(rownames = "Predictors")
## # A tibble: 6 x 6
##   Predictors       Df      `Sum Sq`    `Mean Sq` `F value`  `Pr(>F)`
##   <chr>         <int>         <dbl>        <dbl>     <dbl>     <dbl>
## 1 rank              2 143231765736. 71615882868.   141.     8.43e-47
## 2 discipline        1  18429929986. 18429929986.    36.3    3.95e- 9
## 3 yrs.since.phd     1    165649329.   165649329.     0.326  5.68e- 1
## 4 yrs.service       1   2576287631.  2576287631.     5.07   2.49e- 2
## 5 sex               1    780676354.   780676354.     1.54   2.16e- 1
## 6 Residuals       390 198116333525.   507990599.    NA     NA
#
# #Anova Table of Car Package which automatically takes care of unbalanced designs (Type II)
# #Type II tests each variable after all the others
# #There is a Type III also. However, its usage is highly controversial 
Anova(mod_bb, type = 2) %>% as_tibble(rownames = "Predictors")
## # A tibble: 6 x 5
##   Predictors         `Sum Sq`    Df `F value`  `Pr(>F)`
##   <chr>                 <dbl> <dbl>     <dbl>     <dbl>
## 1 rank           69507674502.     2     68.4   3.40e-26
## 2 discipline     19237331473.     1     37.9   1.88e- 9
## 3 yrs.since.phd   2504060819.     1      4.93  2.70e- 2
## 4 yrs.service     2710023485.     1      5.33  2.14e- 2
## 5 sex              780676354.     1      1.54  2.16e- 1
## 6 Residuals     198116333525.   390     NA    NA

Validation


18 Linear Regression (B26, Jan-02)

18.1 Overview

  • “Machine learning using linear regression”

18.2 Packages

if(FALSE){# #WARNING: Installation may take some time.
  install.packages("car", dependencies = TRUE)
}

18.3 Data: Hospital

Please import the "B26-Hospital.xlsx"

18.4 Data: KC House

Please import the "B26-KC-House.csv"

18.5 Data: CarDekho

Please import the "B26-CarDekho.csv"

EDA

aa <- xxB26CarDekho
# #Split String | Rename | Filter | Factor | Age | Drop |
# #To prevent long dummy names each variable name and its levels have been shorten
# #Each Factor Level has been modified so that levels are in decreasing order of occurance
# #i.e. Fuel Diesel has the most frequent and thus has been converted to Reference Dummy
xsyw <- aa %>% 
  separate(name, c("brand", NA), sep = " ", remove = FALSE, extra = "drop") %>% 
  filter(fuel != "Electric") %>% 
  #mutate(across(where(is.character), factor)) %>% 
  mutate(across(fuel, factor, levels = c("Diesel", "Petrol", "CNG", "LPG"))) %>% 
  mutate(across(transmission, factor, levels = c("Manual", "Automatic"), 
                labels = c("Manual", "Auto"))) %>% 
  mutate(across(owner, factor, 
levels = c("First Owner", "Second Owner", "Third Owner", "Fourth & Above Owner", "Test Drive Car"), 
labels = c("I", "II", "III", "More", "Test"))) %>% 
  mutate(across(seller_type, factor, levels = c("Individual", "Dealer", "Trustmark Dealer"), 
                labels = c("Indiv", "Dealer", "mDealer"))) %>% 
  rename(price = selling_price, km = km_driven, 
         s = seller_type, o = owner, t = transmission, f = fuel) %>% 
  mutate(age = 2022 - year) %>% 
  select(-c(year, name, brand))
# 
xw <- xsyw %>% select(-price)

18.6 Model

  • Question: There are 29 Brands. Should we do the analysis separately for them e.g. When we are creating a Model for Price of ‘Maruti,’ should we not remove ‘BMW’
    • Yes, if we have a large set of data, it would be good to separate each car /brand /model
  • Question: What is the Top Brand
    • Maruti with 1280 out of 4340 cars
  • Question: Can we drop “Petrol” in place of “CNG” i.e. converting it to reference
    • Options available are for ‘First’ or ‘Most Frequent’
    • (Aside) We would need to convert to factor with ‘Petrol’ being the First Level then it can be treates as Reference.

44.3 Overfitting is the production of an analysis that corresponds too closely to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.

44.4 Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data.

  • Question: “1” in the sample() syntax means 1 sample and if we want Two sets we would modify the sample(1:nrow(bb), size = 0.8 * nrow(bb)) to sample(2:nrow(bb), size = 0.8 * nrow(bb))
    • No
    • (Aside) No
      • sample(1:10, size = 3) means pick 3 items out of the set of 10 items which are in [1, 10]
      • sample(11:20, size = 3) means pick 3 items out of the set of 10 items which are in [11, 20]
      • sample(2:10, size = 3) means pick 3 items out of the set of 9 items which are in [2, 10]
      • We are using the selected indices to partition the data in two sets selected (i.e. Train) and not selected (i.e. Test)
      • To create 3 or more sets Refer Partition data in Train & Test
    • While the sample() can take a dataframe i.e. it would not throw error if size is less than number of columns, the outcome is not meaningful and would not match with the expectation.
    • probability given to the sample() is not the partitionining range. It is the proabability of each index to be chosen.
  • Form of the Function
    • \(y \sim x :\) “linear regression model of y on x”
      • Tilde “~” means regressed on i.e. Dependent (Y) ~ Independents (X)
    • \(y \sim x_1 + x_2 :\) “linear regression model of y on \(\{x_1, x_2\}\)
    • \(y \sim . :\) “linear regression model of y on \(\{x_1, x_2, \ldots, x_p\}\)
      • Dot (.) means all variables in the dataset except the dependent variable
  • Question: How to check count of values for each level i.e. each dummy variable
    • (Aside) See Below
  • Question: Is it better to take care of this type of problem in the original dataset
    • Yes
    • There is no point in building a model with 1 “Electric” Car.
  • Question: Is it possible to generalise this process of checking the variable levels and if some have single occurance eliminate them
    • We need to do it by ourselves on a case to case basis
  • Question: If the single data point is highly relevant, then how can we ensure that it is in the training set and not in the test set
    • Ex: if the dataset has gender and it is 100% populated by “Male,” do we need to include that as variable
      • No
    • (Argument) But here we have at least 1 observation and not including that we will lead to the model failing while predicting “Electric”
      • It needs to have some minimum number of observations for model to be based on that
  • Question: What happens if we keep this observation
    • Model may give us wrong fit
    • We should not keep a predictor (dummy variable here) which actually have no relevancy to our model
  • Question: What happens if we decide to create a train dataset where we include at least a minimum number of observations (e.g. 2) for each level i.e. for each dummy variable. Would the Model be considered biased in that case because we have not created the training set randomly
    • Generally, that kind of splitting we will do when we are doing classification problem (Categorical Y)
    • It is "Stratified Random Sampling". There we ensure that we have equal proportion of each group
    • For Independent variable of categorical nature generally we do not do stratified sampling
    • (Further) But can we do this stratified sampling here so that ‘NA’ does not come up in the model outcome because of few or none observations of a level
      • If you have a categorical predictor and majority of datapoints are highly unique (e.g. 99% “Male”), then it is better not to consider that as a predictor
      • Infact it might look like ‘Outlier’
  • Question: Can we extrapolate this logic and say there are only 23 LPG so drop those also
    • We should not drop this one because it has at least some observations
  • Question: So is there a thumb rule of how many minimum observations should be present
    • Judgement i.e. Case by case basis
    • We can look at the significance of the predictor and then we can decide
      • "Stepwise Regression" can remove those predictors which are insignificant
  • Question: What happens if all the 23 observations of ‘LPG’ are grouped into one of the datasets i.e. either train or test because it is a random selection
    • This is not the final model. There will be multiple iterations. Subsequently, it will be included if it is a good predictor or would be dropped if it is insignificant
  • Refer Application of Regression
    • Estimation (Descriptive)
      • Which of these independent variables significantly affect the dependent variable
      • e.g. Which of these factors are influencing the employee performance
      • When we are doing estimation, data partition into train and test datasets is not required.
    • Prediction
      • Partition the Sample data randomly into Train and Test datasets in ratio of 80:20, 70:30 etc.
  • Refer How to Disable Scientic Notation in R

Build

# #Removed Fuel "Electric"
unique(xsyw$f)
## [1] Petrol Diesel CNG    LPG   
## Levels: Diesel Petrol CNG LPG
str(xsyw)
## tibble [4,339 x 7] (S3: tbl_df/tbl/data.frame)
##  $ price: num [1:4339] 60000 135000 600000 250000 450000 140000 550000 240000 850000 365000 ...
##  $ km   : num [1:4339] 70000 50000 100000 46000 141000 125000 25000 60000 25000 78000 ...
##  $ f    : Factor w/ 4 levels "Diesel","Petrol",..: 2 2 1 2 1 2 2 2 2 3 ...
##  $ s    : Factor w/ 3 levels "Indiv","Dealer",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ t    : Factor w/ 2 levels "Manual","Auto": 1 1 1 1 1 1 1 1 1 1 ...
##  $ o    : Factor w/ 5 levels "I","II","III",..: 1 1 1 1 2 1 1 2 1 1 ...
##  $ age  : num [1:4339] 15 15 10 5 8 15 6 8 7 5 ...
#
# #Dummy | Drop First Level i.e. Reference | Drop Selected Columns i.e. Original |
dum_xsyw <- xsyw %>% dummy_cols(.data = ., 
                  select_columns = c("f", "s", "t", "o"), 
                  remove_first_dummy = TRUE, remove_selected_columns = TRUE)
names(dum_xsyw)
##  [1] "price"     "km"        "age"       "f_Petrol"  "f_CNG"     "f_LPG"     "s_Dealer"  "s_mDealer"
##  [9] "t_Auto"    "o_II"      "o_III"     "o_More"    "o_Test"
#
# #Partition Data
set.seed(3)
#idx_xsyw <- sample(x = 1:nrow(dum_xsyw), size = 0.8 * nrow(dum_xsyw)) #Equivalent
idx_xsyw <- sample.int(n = nrow(dum_xsyw), size = floor(0.8 * nrow(dum_xsyw)), replace = FALSE)
train_xsyw <- dum_xsyw[idx_xsyw, ]
test_xsyw  <- dum_xsyw[-idx_xsyw, ]
#
mod_xsyw <- lm(price ~ ., data = train_xsyw)
if(TRUE) f_pNum(summary(mod_xsyw)$coefficients) %>% as_tibble(rownames = "DummyParVsRef") %>% 
  rename(pVal = "Pr(>|t|)") %>% 
  mutate(pVal = ifelse(pVal < 0.001, 0, pVal), isSig = ifelse(pVal < 0.05, TRUE, FALSE))
## # A tibble: 13 x 6
##    DummyParVsRef   Estimate `Std. Error` `t value`   pVal isSig
##    <chr>              <dbl>        <dbl>     <dbl>  <dbl> <lgl>
##  1 (Intercept)    925260.       20095.       46.0  0      TRUE 
##  2 km                 -0.79         0.18     -4.43 0      TRUE 
##  3 age            -35084.        2023.      -17.3  0      TRUE 
##  4 f_Petrol      -282851.       15047.      -18.8  0      TRUE 
##  5 f_CNG         -292937.       72253.       -4.05 0      TRUE 
##  6 f_LPG         -239453.       89028.       -2.69 0.0072 TRUE 
##  7 s_Dealer        38973.       17383.        2.24 0.025  TRUE 
##  8 s_mDealer      223254.       46730.        4.78 0      TRUE 
##  9 t_Auto         830131.       23513.       35.3  0      TRUE 
## 10 o_II           -44199.       17623.       -2.51 0.012  TRUE 
## 11 o_III          -47941.       29600.       -1.62 0.11   FALSE
## 12 o_More         -24595.       54846.       -0.45 0.65   FALSE
## 13 o_Test         176423.      113694.        1.55 0.12   FALSE
#
# #Anova Table 
if(FALSE) anova(mod_xsyw) %>% as_tibble(rownames = "Predictors") %>% 
  rename(pVal = "Pr(>F)") %>% 
  mutate(pVal = ifelse(pVal < 0.001, 0, pVal), isSig = ifelse(pVal < 0.05, TRUE, FALSE))

Top Brands

# #What are the Top Brands (29 levels)
aa %>% 
  separate(name, c("brand", NA), sep = " ", remove = FALSE, extra = "drop") %>% 
  count(brand) %>% arrange(desc(n))
## # A tibble: 29 x 2
##    brand          n
##    <chr>      <int>
##  1 Maruti      1280
##  2 Hyundai      821
##  3 Mahindra     365
##  4 Tata         361
##  5 Honda        252
##  6 Ford         238
##  7 Toyota       206
##  8 Chevrolet    188
##  9 Renault      146
## 10 Volkswagen   107
## # ... with 19 more rows

Change Reference Level

# #To Make different level as the reference i.e. Petrol in place of CNG
ii <- xsyw
levels(ii$f) 
## [1] "Diesel" "Petrol" "CNG"    "LPG"
jj <- ii %>% mutate(f = relevel(f, ref = "CNG"))
levels(jj$f)
## [1] "CNG"    "Diesel" "Petrol" "LPG"

sample()

set.seed(3)
# #Create a chracter set of 10 items. 
# #Initial 10 letters were not chosen to show difference between 
# #indexing numbers /position and actual item values.
ii <- letters[11:20]
ii
##  [1] "k" "l" "m" "n" "o" "p" "q" "r" "s" "t"
# #Note: length() is used on vectors & nrow() is used on dataframes
# #Pick 3 letters out of these 10 items and indexing numbers can be within [1, 10]
idx <- sample(1:length(ii), size = 3)
ii[idx] #If ii is dataframe then a comma would be required i.e. ii[idx, ]
## [1] "o" "q" "n"
#
# #We can directly pick up 3 items if it is a Vector
sample(ii, size = 3) #Not applicable for dataframes etc. and thus should be avoided.
## [1] "t" "r" "m"
#
# #Going back to using length() or nrow()
#
# #Pick 3 letters out of these 10 items and indexing numbers can be within [2, 10]
# #i.e. First Index letter "k" can never be chosen
idx <- sample(2:length(ii), size = 3)
ii[idx]
## [1] "s" "o" "m"
#
# #Pick 3 letters out of these 10 items and indexing numbers can be within [6, 10]
# #First 5 letters can never be chosen i.e. "k" "l" "m" "n" "o" excluded
# #Further, Some remaining letters have different probabilities /weightage
# #Note Sum of Probabilities need not to be 1 it can be less than that
# #Length of Probability Vector needs to match the Range Length of Indexing i.e. 6-10 has 5 items
idx <- sample(6:length(ii), size = 3, prob = c(0.2, 0.3, 0.1, 0.1, 0.1))
ii[idx]
## [1] "p" "r" "s"
#
# #Note, using this we are always getting an index of items chosen thus resulting in a set of two
# #To get 3 sets, the syntax is different and covered elesewhere.

Count of Each Level

# #For any Given Column, What is the Count for each Level
xw %>% count(f) %>% arrange(desc(n))
## # A tibble: 4 x 2
##   f          n
##   <fct>  <int>
## 1 Diesel  2153
## 2 Petrol  2123
## 3 CNG       40
## 4 LPG       23
#
# #What is the Count for each Dummy Variable
dum_xsyw %>% summarise(across(4:ncol(.), sum)) %>% pivot_longer(everything())
## # A tibble: 10 x 2
##    name      value
##    <chr>     <int>
##  1 f_Petrol   2123
##  2 f_CNG        40
##  3 f_LPG        23
##  4 s_Dealer    993
##  5 s_mDealer   102
##  6 t_Auto      447
##  7 o_II       1105
##  8 o_III       304
##  9 o_More       81
## 10 o_Test       17

18.7 Explanation of Estimates

  • With Referenced Variabls and Keeping others constant
  • Significant and inversely affecting the Selling Price
    • km_driven
    • age
  • Significant and positively affecting the Selling Price
    • Diesel has higher selling price compared to CNG
    • Other Fuels do not have significant impact (when CNG is the Reference)
  • Question: Are the km_driven and age not correlated. Would these together not cause the multicollinearity issue
    • We can check by ‘cor(),’ however it is not very high

18.8 Correlation Plots

  • (Aside) Scaling of Dummy Variables would not impact negatively to the analysis. However scaling is done to adjust for large variances. As all the dummy variables are {0, 1}, these have not been scaled.
  • (Aside) The referenced dummy variable is NOT included for example Diesel
    • It would be highly correlated (inversely) to the 2nd-most frequent level (Petrol) because when it occurs the other one does not happen and when that happens 1st-one would not happen
# #Exclude Y | Scale Continuous NOT Dummies |
zw <- dum_xsyw %>% select(-price) %>% 
  mutate(across(c(km, age), ~ as.vector(scale(.))))
# #Long
#f_wl(zw)

Images

(B26P01) CarDekho: GGplot: Corrplot of Dummies (Scaled)

Figure 18.1 (B26P01) CarDekho: GGplot: Corrplot of Dummies (Scaled)

(B26P02 B26P03) CarDekho: corrplot vs. psych(B26P02 B26P03) CarDekho: corrplot vs. psych

Figure 18.2 (B26P02 B26P03) CarDekho: corrplot vs. psych

Code GGplot

# #IN: zw
cap_hh <- "B26P01"
ttl_hh <- "CarDekho: GGplot: Corrplot of Dummies (Scaled)"
sub_hh <- "showing only the correlation not significance"
lgd_hh <- "Correlation"
#
# #Correlation Matrix pXp | Tibble pX(p+1) | Long | Unique Factor | Remove duplicates AB = BA |
# #Factor with Unique is better to keep the sequence as occurred, default is alphabetical
hh <- cor(zw) %>% 
  as_tibble(rownames = "dummies") %>% 
  pivot_longer(cols = -dummies) %>% 
  mutate(across(where(is.character), factor, levels = unique(name))) %>% 
  filter(!duplicated(paste0(pmax(as.character(dummies), as.character(name)), 
                            pmin(as.character(dummies), as.character(name)))))
# #IN: hh[dummies, names, value] (Correlation Tibble Long, Triangle with Diagonal) 
B26 <- hh %>% { ggplot(., aes(x = dummies, y = name, fill = value)) + 
    geom_tile(color = "white") + 
    geom_text(aes(label = round(value, 2)), color = "black", size = 4) +
    coord_fixed() +
    scale_fill_distiller(palette = "BrBG", direction = 1, limits = c(-1, 1)) +
    #scale_x_discrete(position = "top") +
    scale_y_discrete(limits = rev) +
    guides(fill = guide_colourbar(barwidth = 0.5, barheight = 15)) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1),
          axis.title = element_blank(), 
          axis.line = element_blank(), 
          axis.ticks = element_blank(),
          panel.grid.major = element_blank(), 
          panel.border = element_blank()) +
      labs(fill = lgd_hh, subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B26)
rm(B26)

Code Corrplot

if(FALSE){#Testing
  ii_dum_xsyw <- xsyw %>% dummy_cols(.data = ., 
                  select_columns = c("f", "s", "t", "o"), 
                  remove_first_dummy = FALSE, remove_selected_columns = TRUE)
  ii <- ii_dum_xsyw %>% select(-price) %>% 
    mutate(across(c(km, age), ~ as.vector(scale(.))))
  hh <- cor(ii)
  corr_hh <- corrplot::cor.mtest(ii)
}
#
hh <- cor(zw)
corr_hh <- corrplot::cor.mtest(zw)
#
cap_hh <- "B26P02"
ttl_hh <- "CarDekho: corrplot: Corrplot of Dummies (Scaled)"
loc_png <- paste0(.z$PX, "B26P02", "-CarDekho-corrplot-Corrplot-z", ".png")
#
if(!file.exists(loc_png)) {
  png(filename = loc_png) 
  corrplot::corrplot(hh, method = "circle", type = "lower", diag = FALSE, 
                   cl.pos = 'n', tl.pos = 'ld', addCoef.col = "black", 
                   p.mat = corr_hh$p, sig.level = 0.05, insig = 'blank', 
        #order = "hclust", hclust.method = "ward.D", addrect = 2, rect.col = 3, rect.lwd = 3, 
                   title = NULL #, col = RColorBrewer::brewer.pal(3, "BrBG")
                   )
  title(main = ttl_hh, line = 2, adj = 0)
  title(sub = cap_hh, line = 4, adj = 1)
  B26 <- recordPlot()
  dev.off()
  assign(cap_hh, B26)
  rm(B26)
}

Code psych

hh <- psych::corr.test(zw)
#
cap_hh <- "B26P03"
ttl_hh <- "CarDekho: psych: Corrplot of Dummies (Scaled)"
loc_png <- paste0(.z$PX, "B26P03", "-CarDekho-Psych-Corrplot-z", ".png")
#
#
if(!file.exists(loc_png)|TRUE) {#xxxx
  png(filename = loc_png) 
  psych::corPlot(hh$r, pval = hh$p, upper = FALSE, diag = FALSE, show.legend = TRUE, 
                 xlas = 2, cex = 0.6,
                 #keep.par = FALSE, 
    gr = colorRampPalette(RColorBrewer::brewer.pal(3, "BrBG")), main = ttl_hh)
  title(sub = cap_hh, line = 4, adj = 1)
  B26 <- recordPlot()
  dev.off()
  assign(cap_hh, B26)
  rm(B26)
}

Why NOT SPLOM

  • (Aside)
    • There are 2 Packages for SPLOM Plots Psych and GGally.
      • Examples of these plots can be found in other chapters. - “ForLater” Add Links
    • The Problem with these plots is that screen space is limited and as soon as the number of variables go beyond 4, it becomes highly difficult to make sense of anything.
      • And it takes a long time to plot them.
      • These plots will only be included if these make sense.
    • There are different plots which would show only the correlation number and those are more efficient.

18.9 VIF

37.10 The variance inflation factors (VIF) is given by \(\text{VIF}_i = \frac{1}{1 - R_i^2} \in [1, \infty]\). That is, the minimum value for VIF is 1, and is reached when \(x_i\) is completely uncorrelated with the remaining predictors.

  • If the multicollinearity is present then model performance decreases
    • We can check correlation or VIF
    • \(R_i^2 = 0.80 \to \text{VIF} \geq 5\) to be an indicator of moderate multicollinearity
    • \(R_i^2 = 0.90 \to \text{VIF} \geq 10\) to be an indicator of severe multicollinearity
    • (Not Shown Here) But if the model is created with reference level of ‘CNG’ in fuel, then Petrol and Diesel will show very high VIF and high correlation
      • Because, when a Car runs on Petrol it does not run on Diesel and when it runs on Diesel it does not run on Petrol.
      • Number of observations for both of these levels are similar
      • Thus, it is better to convert the most frequent level (Diesel) as the reference level
# #vif() To check VIF of the Model. All values should be < 5 (desirable) or < 10 (recommended)
vif(mod_xsyw)
##        km       age  f_Petrol     f_CNG     f_LPG  s_Dealer s_mDealer    t_Auto      o_II     o_III 
##  1.468655  1.568248  1.201782  1.012618  1.012142  1.143033  1.031406  1.067177  1.269446  1.202164 
##    o_More    o_Test 
##  1.085093  1.024226

18.10 Stepwise Regression

37.11 In stepwise regression, the regression model begins with no predictors, then the most significant predictor is entered into the model, followed by the next most significant predictor. At each stage, each predictor is tested whether it is still significant. The procedure continues until all significant predictors have been entered into the model, and no further predictors have been dropped. The resulting model is usually a good regression model, although it is not guaranteed to be the global optimum.

  • The stepwise procedure represents a modification of the forward selection procedure.
    • In Forward Selection, we start with no variables in model, add most highly correlated variable (correlated to Y), check for significance, keep doing this for other variables in decreasing order of correltaion until the model remains significant.
    • In Backward Elimination, we start with all the variables in the model, select the variable with smallest partial F-statistic, remove it if it is insignificant, keep doing this for other variables in the increasing order of partial F-statistic until these remain insignificant.
  • There are 3 dummy variables which have low p-value.
    • All were kept by “forward”
    • Only 1 of them (o_More) was dropped by “backward”
    • Only 1 of them (o_More) was dropped by “both”
    • “ForLater” Theoretically it is understandable that some insignificant variables were kept because algorithm run differently than the simplistic p-value based approach. However, does it means that elimination or retaintion of variables should NOT be done based on p-value
    • “ForLater” It has been observed that selection of reference level changes the model outcome, significance of dummy variables etc. Which Level should be chosen as Reference. Currently, I am going with the idea that most frequent level should be the reference.
  • Question: What are 12 elements / 13 elements shown about these Models
    • Elements are Attributes of the Model, not the number of variables in the model
    • (Aside) Base Model has 12 attributes. Model returned by step() has 1 more attibute (anova)

Model Stepwise

# #step() can provide Stepwise Regression # "forward" "backward" "both"
stp_xsyw <- step(mod_xsyw, direction = "backward", trace = 0)
#stp_xsyw
#summary(stp_xsyw)
# #It adds another attribute (anova) to the model and thus shows 13 attributes
names(stp_xsyw) #13
##  [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"       
##  [7] "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"        
## [13] "anova"
#
if(TRUE) f_pNum(summary(stp_xsyw)$coefficients) %>% as_tibble(rownames = "DummyParVsRef") %>% 
  rename(pVal = "Pr(>|t|)") %>% 
  mutate(pVal = ifelse(pVal < 0.001, 0, pVal), isSig = ifelse(pVal < 0.05, TRUE, FALSE))
## # A tibble: 12 x 6
##    DummyParVsRef   Estimate `Std. Error` `t value`   pVal isSig
##    <chr>              <dbl>        <dbl>     <dbl>  <dbl> <lgl>
##  1 (Intercept)    926100.       20006.       46.3  0      TRUE 
##  2 km                 -0.79         0.18     -4.45 0      TRUE 
##  3 age            -35276.        1977.      -17.8  0      TRUE 
##  4 f_Petrol      -282828.       15045.      -18.8  0      TRUE 
##  5 f_CNG         -293484.       72235.       -4.06 0      TRUE 
##  6 f_LPG         -239715.       89016.       -2.69 0.0071 TRUE 
##  7 s_Dealer        39581.       17328.        2.28 0.022  TRUE 
##  8 s_mDealer      223696.       46715.        4.79 0      TRUE 
##  9 t_Auto         830032.       23510.       35.3  0      TRUE 
## 10 o_II           -42724.       17312.       -2.47 0.014  TRUE 
## 11 o_III          -46045.       29294.       -1.57 0.12   FALSE
## 12 o_Test         175419.      113659.        1.54 0.12   FALSE
#
# #anova table attribute for the iterations performed (not present in Base Model)
stp_xsyw$anova
##       Step Df    Deviance Resid. Df      Resid. Dev      AIC
## 1          NA          NA      3458 565224319985017 89633.49
## 2 - o_More  1 32869849259      3459 565257189834275 89631.70

Model Original

#mod_xsyw
#summary(mod_xsyw)
# #Base Model has 12 attributes
names(mod_xsyw) #12
##  [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"       
##  [7] "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"
#
if(TRUE) f_pNum(summary(mod_xsyw)$coefficients) %>% as_tibble(rownames = "DummyParVsRef") %>% 
  rename(pVal = "Pr(>|t|)") %>% 
  mutate(pVal = ifelse(pVal < 0.001, 0, pVal), isSig = ifelse(pVal < 0.05, TRUE, FALSE))
## # A tibble: 13 x 6
##    DummyParVsRef   Estimate `Std. Error` `t value`   pVal isSig
##    <chr>              <dbl>        <dbl>     <dbl>  <dbl> <lgl>
##  1 (Intercept)    925260.       20095.       46.0  0      TRUE 
##  2 km                 -0.79         0.18     -4.43 0      TRUE 
##  3 age            -35084.        2023.      -17.3  0      TRUE 
##  4 f_Petrol      -282851.       15047.      -18.8  0      TRUE 
##  5 f_CNG         -292937.       72253.       -4.05 0      TRUE 
##  6 f_LPG         -239453.       89028.       -2.69 0.0072 TRUE 
##  7 s_Dealer        38973.       17383.        2.24 0.025  TRUE 
##  8 s_mDealer      223254.       46730.        4.78 0      TRUE 
##  9 t_Auto         830131.       23513.       35.3  0      TRUE 
## 10 o_II           -44199.       17623.       -2.51 0.012  TRUE 
## 11 o_III          -47941.       29600.       -1.62 0.11   FALSE
## 12 o_More         -24595.       54846.       -0.45 0.65   FALSE
## 13 o_Test         176423.      113694.        1.55 0.12   FALSE

18.11 Model Validation

  • For some data points, the error is Huge, otherwise the Models look OK.
    • Later, we would compare RMSE of different algorithms to identify the best algorithm applicable to the secific dataset
# #predict() on test dataset by Base Model and Stepwise corrected Model
#pred_mod_xsyw <- predict(mod_xsyw, test_xsyw)
res_mod_xsyw <- test_xsyw %>% mutate(CalY = predict(mod_xsyw, .), Y_Yc = price - CalY)
res_stp_xsyw <- test_xsyw %>% mutate(CalY = predict(stp_xsyw, .), Y_Yc = price - CalY)
#
res_w <- tibble(Model = res_mod_xsyw$Y_Yc, Step = res_stp_xsyw$Y_Yc) 
f_wl(res_w)
## [1] "res_l"
#
summary(res_w)
##      Model               Step         
##  Min.   :-1014209   Min.   :-1012979  
##  1st Qu.: -172383   1st Qu.: -172430  
##  Median :  -41009   Median :  -40885  
##  Mean   :    3711   Mean   :    3562  
##  3rd Qu.:   83432   3rd Qu.:   84611  
##  Max.   : 7609249   Max.   : 7609070
#
# #RMSE: Root Mean Squared Error for Both Models (Loss Function)
res_w %>% summarise(across(everything(), ~ sqrt(mean((.)^2))))
## # A tibble: 1 x 2
##     Model    Step
##     <dbl>   <dbl>
## 1 505785. 505707.
#
# #MAE: Mean Absolute Error (MAE)
res_w %>% summarise(across(everything(), ~ mean(abs(.))))
## # A tibble: 1 x 2
##     Model    Step
##     <dbl>   <dbl>
## 1 242240. 242295.
(B26P04) CarDekho: BoxPlot of Results

Figure 18.3 (B26P04) CarDekho: BoxPlot of Results

Validation


19 WIP (B27, Jan-09)

19.1 Overview

  • “Machine learning using linear regression”
    • “ForLater”

19.2 WIP

19.3 Normality

Non-Normal

19.4 Multicollinearity

19.5 Transformation

Normal Now (Prof got p value less than 0.05)

19.6 Glance

Validation


20 WIP (B28, Jan-16)

20.1 Overview

  • “Machine learning using linear regression”

20.2 WIP

Validation


21 WIP (B29, Jan-23)

21.1 Overview

  • “Ridge, Lasso and Elastic Net regressions”

21.2 WIP

Validation


22 WIP (B30, Jan-30)

22.1 Overview

  • “Decision Tree Algorithm”

22.2 Prep KC House

  • About: [21613, 21]
    • One Column has 2 NA
    • Date needs to be converted
    • Integer to Categorical conversion is needed
    • yr_renovated needs to be handled 0 means no renovation - We can convert to Factor of Yes/No
    • sqft_basement - Similarly Yes/No
    • There are big area houses without any bedroom or bathroom
    • Renovated House is NOT a new house.
    • Calculate Age = Date of Sales - Year Built
    • There are 8 houses with negative age i.e. sold first completed later
    • There are 430 houses with 0 age i.e. sold in same year
    • These can happen anyway
  • Question: Property price are affected by location. Why we are removing lat/long
    • We are already including different types like waterfront etc.
    • It would have been better if we have rural, urban, city centre, market type categories
    • (Aside) I do not agree specially zipcodes. We cound have identiify clusters of zipcodes.
  • Question: Based on description of sqft_living15, would this not cause Multicollinearity issue
    • “ForLater”
  • Question: Average Price over zipcode has clear distinctions
    • “ForLater”
  • Question: Why the age is not taken as Today
    • Price is of the date it was sold. Our analysis date does not change the price.
aa <- xxB26KC
#
names(aa)
bb <- aa %>% drop_na() %>% select(-c(id, view, zipcode, lat, long)) %>% mutate(Sold = year(date), Age = Sold - yr_built) %>% relocate(Age, Sold) %>% mutate(isRenew = ifelse(yr_renovated == 0, 0, 1)) %>% relocate(isRenew) %>% rename(Beds = bedrooms, Baths = bathrooms, sqLiv = sqft_living, sqLot = sqft_lot) %>% select(-date, -Sold, -yr_renovated) %>% relocate(price)
if(FALSE) str(bb)
if(FALSE) summary(bb)
if(TRUE) head(bb)

22.3 Correlation Matrix

kc_zsyw <- bb %>% mutate(across(where(is.numeric), ~ as.vector(scale(.)))) 
f_wl(kc_zsyw)
hh <- cor(kc_zsyw)
cap_hh <- paste0("Correlation Matrix") 
f_pKblM(x = hh, caption = cap_hh, negPos = c(-0.5, 0.5), dig = 3, debug = TRUE)

22.4 Boxplot

# #IN: hh(Keys, Values), 
C34 <- hh %>% { ggplot(data = ., mapping = aes(x = Keys, y = Values, fill = Keys)) +
    geom_boxplot() +
    k_gglayer_box +
    scale_y_continuous(breaks = breaks_pretty()) + 
    coord_flip() +
    theme(legend.position = 'none') +
    labs(x = NULL, y = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C34)
rm(C34)
C34P01

22.5 Outliers

# #Remove anythind beyond 2 SD of Price
ii <- bb %>% filter(!(abs(price - median(price)) > 2*sd(price)))
# #Remove anythind beyond 2.5 Scaled of Price
jj <- bb %>% mutate(zPrice = scale(price)) %>% relocate(zPrice) %>% filter(between(zPrice, -2.5, +2.5))
ii <- ii %>% select(-sqft_basement)

index_ii <- sample(1:nrow(ii), .80*nrow(ii))
train <- ii[index_ii,]
test <- ii[-index_ii,]
#
str(train)
## run the linear regression model
model1 <- lm(price ~ ., data = train)
summary(model1)
# #Get PRediction 
predicted <- predict(model1, newdata = test)# predict the test data
table1 <- data.frame(Actual  = test$price, Predicted = predicted)#createa  table with actual test and predictd test
mape_test <- mean(abs(table1$Actual - table1$Predicted) / table1$Actual)
accuracy_test <- 1 - mape_test
accuracy_test
# #Train
#custom control parameters

custom <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
#
ridge <- train(price ~ ., train, method = "glmnet", 
                tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 5)), 
               trControl = custom)
# #
names(ridge)
predicted_ridge <- predict(ridge, newdata = test)# predict the test data
table_ridge <- data.frame(Actual  = test$price, Predicted = predicted_ridge)#createa  table with actual test and predictd test
str(table_ridge)
#
#

mape_ridge <- mean(abs(table_ridge$Actual - table_ridge$Predicted) / table_ridge$Actual)
accuracy_ridge <- 1 - mape_ridge
accuracy_ridge #73.3%

22.6 Lassso


custom <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
lasso <- train (price ~., train, method = "glmnet",
                tuneGrid = expand.grid(alpha = 1, lambda = seq(0.0001, 1, length = 5)), 
                trControl = custom)
#
predicted_lasso <- predict(lasso, newdata = test)# predict the test data
table_lasso <- data.frame(Actual  = test$price, Predicted = predicted_lasso)#createa  table with actual test and predictd test
str(table_lasso)

mape_lasso <- mean(abs(table_lasso$Actual - table_lasso$Predicted) / table_lasso$Actual)
accuracy_lasso <- 1 - mape_lasso
accuracy_lasso #73.7%
accuracy_lasso
accuracy_ridge
accuracy_test
accuracy_els

22.7 Elastic

custom <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
#lasso <- train (price ~., train, method = "glmnet",
                #tuneGrid = expand.grid(alpha = 1, lambda = seq(0.0001, 1, length = 5)), 
                #trControl = custom)
elastic <- train (price ~ ., train, method = "glmnet", 
                  tuneGrid = expand.grid(alpha = seq(0, 1, length = 10), 
                                         lambda = seq(0.0001, 1, length = 5)), trControl = custom)

predicted_els <- predict(elastic, newdata = test)# predict the test data
table_els <- data.frame(Actual  = test$price, Predicted = predicted_els)#createa  table with actual test and predictd test
str(table_els)
mape_els <- mean(abs(table_els$Actual - table_els$Predicted) / table_els$Actual)
accuracy_els <- 1 - mape_els
accuracy_els #73.7% 0.737696

22.8 New Dataset Description

DATASET_2: This dataset provides features (related to demographics and buying behaviour) that are very relevant to predicting an auto insurance company’s customer lifetime value (CLV) . For more details about the features, see the dataset. Following the given dataset, apply customer lifetime value as the target/dependent variable, use relevant features provided in the data as the independent variables, and develop predictive models. Considering the given situation as the regression problem, execute linear regression to predict the target variable. Use the results and provide necessary recommendations.

Validation


WIP (B31, Feb-06)

22.9 Overview

22.10 Packages

if(FALSE){# #WARNING: Installation may take some time.
  install.packages("rpart", dependencies = TRUE)
  install.packages("rpart.plot", dependencies = TRUE)
}

22.11 Decision Trees

Data Car Dekho

Validation


23 Data and Statistics

Definitions and Exercises are from the Book (David R. Anderson 2018)

23.1 Overview

23.2 Introduction

Definition 23.1 Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.
Definition 23.2 Elements are the entities on which data are collected. (Generally ROWS)
Definition 23.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)
Definition 23.4 The set of measurements obtained for a particular element is called an observation.

Hence, the total number of data items can be determined by multiplying the number of observations by the number of variables.

Definition 23.5 Statistics is the art and science of collecting, analysing, presenting, and interpreting data.

23.3 Scales of Measurement

Data collection requires one of the following scales of measurement: nominal, ordinal, interval, or ratio.
Definition 23.6 The scale of measurement determines the amount of information contained in the data and indicates the most appropriate data summarization and statistical analyses.
Definition 23.7 When the data for a variable consist of labels or names used to identify an attribute of the element, the scale of measurement is considered a nominal scale.

For Example, Gender as Male and Female. In cases where the scale of measurement is nominal, a numerical code as well as a nonnumerical label may be used. For example, 1 denotes Male, 2 denotes Female. The scale of measurement is nominal even though the data appear as numerical values. Only Mode can be calculated.

Definition 23.8 The scale of measurement for a variable is considered an ordinal scale if the data exhibit the properties of nominal data and in addition, the order or rank of the data is meaningful.

For example, Size as small, medium, large. Along with the labels, similar to nominal data, the data can also be ranked or ordered, which makes the measurement scale ordinal. Ordinal data can also be recorded by a numerical code. Median can be calculated but not the Mean.

Definition 23.9 The scale of measurement for a variable is an interval scale if the data have all the properties of ordinal data and the interval between values is expressed in terms of a fixed unit of measure.

Interval data are always numerical. These can be ranked or ordered like ordinal. In addition, the differences between them are meaningful.

Definition 23.10 The scale of measurement for a variable is a ratio scale if the data have all the properties of interval data and the ratio of two values is meaningful.

Variables such as distance, height, weight, and time use the ratio scale of measurement. This scale requires that a zero value be included to indicate that nothing exists for the variable at the zero point. Mean can be calculated.

For example, consider the cost of an automobile. A zero value for the cost would indicate that the automobile has no cost and is free. In addition, if we compare the cost of 30,000 dollars for one automobile to the cost of 15,000 dollars for a second automobile, the ratio property shows that the first automobile is 30000/15000 = 2 times, or twice, the cost of the second automobile.

See Table 23.1 for more details.

23.3.1 Interval scale vs. Ratio scale

Interval scale is a measure of continuous quantitative data that has an arbitrary 0 reference point. This is contrasted with ratio scaled data which have a non-arbitrary 0 reference point. Ex: When we look at “profit” we see that negative profit does make sense to us. So while, the 0 for “profit” is meaningful (just like temperature measurements of Celsius) it is arbitrary. Therefore, it is Interval scale of measurement.

In an interval scale, you can take difference of two values. You may not be able to take ratios of two values. Ex: Temperature in Celsius. You can say that if temperatures of two places are 40 °C and 20 °C, then one is hotter than the other (taking difference). But you cannot say that first is twice as hot as the second (not allowed to take ratio).

In a ratio scale, you can take a ratio of two values. Ex: 40 kg is twice as heavy as 20 kg (taking ratios).

Also, “0” on ratio scale means the absence of that physical quantity. “0” on interval scale does not mean the same. 0 kg means the absence of weight. 0 °C does not mean absence of heat.

Table 23.1: (C01V01) Interval scale Vs Ratio scale
Features Interval scale Ratio scale
Variable property Addition and subtraction Multiplication and Division i.e. calculate ratios. Thus, you can leverage numbers on the scale against 0.
Absolute Point Zero Zero-point in an interval scale is arbitrary. For example, the temperature can be below 0 °C and into negative temperatures. The ratio scale has an absolute zero or character of origin. Height and weight cannot be zero or below zero.
Calculation Statistically, in an interval scale, the Arithmetic Mean is calculated. Statistical dispersion permits range and standard deviation. The coefficient of variation is not permitted. Statistically, in a ratio scale, the Geometric or Harmonic mean is calculated. Also, range and coefficient of variation are permitted for measuring statistical dispersion.
Measurement Interval scale can measure size and magnitude as multiple factors of a defined unit. Ratio scale can measure size and magnitude as a factor of one defined unit in terms of another.
Example Temperature in Celsius, Calendar years and time, Profit These possesses an absolute zero characteristic, like age, weight, height, or Sales

23.4 Categorical and Quantitative Data

Definition 23.11 Data that can be grouped by specific categories are referred to as categorical data. Categorical data use either the nominal or ordinal scale of measurement.
Definition 23.12 Data that use numeric values to indicate ‘how much’ or ‘how many’ are referred to as quantitative data. Quantitative data are obtained using either the interval or ratio scale of measurement.

If the variable is categorical, the statistical analysis is limited. We can summarize categorical data by counting the number of observations in each category or by computing the proportion of the observations in each category. However, even when the categorical data are identified by a numerical code, arithmetic operations do not provide meaningful results.

Arithmetic operations provide meaningful results for quantitative variables. For example, quantitative data may be added and then divided by the number of observations to compute the average value.

Quantitative data may be discrete or continuous.

Definition 23.13 Quantitative data that measure ‘how many’ are discrete.
Definition 23.14 Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.

23.5 Cross-Sectional and Time Series Data

Definition 23.15 Cross-sectional data are data collected at the same or approximately the same point in time.
Definition 23.16 Time-series data data are data collected over several time periods.

23.6 Observational Study and Experiment

Definition 23.17 In an observational study we simply observe what is happening in a particular situation, record data on one or more variables of interest, and conduct a statistical analysis of the resulting data.
Definition 23.18 The key difference between an observational study and an experiment is that an experiment is conducted under controlled conditions.

As a result, the data obtained from a well-designed experiment can often provide more information as compared to the data obtained from existing sources or by conducting an observational study.

23.7 Caution

  1. Time and Cost - The cost of data acquisition and the subsequent statistical analysis should not exceed the savings generated by using the information to make a better decision.
  2. Data Acquisition Errors - An error in data acquisition occurs whenever the data value obtained is not equal to the true or actual value that would be obtained with a correct procedure. Ex: recording error, misinterpretation etc. Blindly using any data that happen to be available or using data that were acquired with little care can result in misleading information and bad decisions.

23.8 Descriptive Statistics

Definition 23.19 Most of the statistical information is summarized and presented in a form that is easy to understand. Such summaries of data, which may be tabular, graphical, or numerical, are referred to as descriptive statistics.

23.9 Population and Sample

Definition 23.20 A population is the set of all elements of interest in a particular study.
Definition 23.21 A sample is a subset of the population.
Definition 23.22 The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.

Refer Sample For More …

23.10 Difference between a population and a sample

The population is the set of entities under study.

  • For example, the mean height of men. (Population “men,” parameter of interest “height”)
    • We choose the population that we wish to study.
    • Typically it is impossible to survey/measure the entire population because not all members are observable.
    • If it is possible to enumerate the entire population it is often costly to do so and would take a great deal of time.

Instead, we could take a subset of this population called a sample and use this sample to draw inferences about the population under study, given some conditions.

  • It is an inference because there will be some uncertainty and inaccuracy involved in drawing conclusions about the population based upon a sample.
    • In Simple Random Sampling (SRS) each member of the population has an equal probability of being included in the sample, hence the term “random.” There are many other sampling methods e.g. stratified sampling, cluster sampling, etc.

23.11 Statistical Inference

Definition 23.23 The process of conducting a survey to collect data for the entire population is called a census.
Definition 23.24 The process of conducting a survey to collect data for a sample is called a sample survey.
Definition 23.25 Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.

Whenever statisticians use a sample to estimate a population characteristic of interest, they usually provide a statement of the quality, or precision, associated with the estimate.

Inferential statistics are used for Hypothesis Testing.

  • It is often used to compare the differences between the treatment groups.
  • It uses measurements from the sample of subjects in the experiment to compare the treatment groups and make generalizations about the larger population of subjects.
  • Most inferential statistics are based on the principle that a test-statistic value is calculated on the basis of a particular formula.
    • That value along with the degrees of freedom, and the rejection criteria are used to determine whether differences exist between the treatment groups.
    • The larger the sample size, the more likely a statistic is to indicate that differences exist between the treatment groups.

The two most common types of Statistical Inference are -

  1. Confidence Intervals
    • To estimate a population parameter
  2. Test of Significance
    • To assess the evidence provided by data about some claim concerning a population
    • i.e. To compare observed data with a claim (Hypothesis)
    • The results of a significance test are expressed in terms of a probability that measures how well the data and the claim agree

Reasoning for Tests of Significance

  • Example: Is the sample mean \({\overline{x}}\) significantly different from population mean \({\mu}\)
  • To determine if two numbers are significantly different, a statistical test must be conducted to provide evidence

23.12 Analytics

Definition 23.26 Analytics is the scientific process of transforming data into insight for making better decisions.

Analytics is used for data-driven or fact-based decision making, which is often seen as more objective than alternative approaches to decision making. The tools of analytics can aid decision making by creating insights from data, improving our ability to more accurately forecast for planning, helping us quantify risk, and yielding better alternatives through analysis.
Analytics is now generally thought to comprise three broad categories of techniques. These categories are descriptive analytics, predictive analytics, and prescriptive analytics.

Definition 23.27 Descriptive analytics encompasses the set of analytical techniques that describe what has happened in the past.

Examples of these types of techniques are data queries, reports, descriptive statistics, data visualization, data dash boards, and basic what-if spreadsheet models.

Definition 23.28 Predictive analytics consists of analytical techniques that use models constructed from past data to predict the future or to assess the impact of one variable on another.

Linear regression, time series analysis, and forecasting models fall into the category of predictive analytics. Simulation, which is the use of probability and statistical computer models to better understand risk, also falls under the category of predictive analytics.

Prescriptive analytics differs greatly from descriptive or predictive analytics. What distinguishes prescriptive analytics is that prescriptive models yield a best course of action to take. That is, the output of a prescriptive model is a best decision.

Definition 23.29 Prescriptive analytics is the set of analytical techniques that yield a best course of action.

Optimization models, which generate solutions that maximize or minimize some objective subject to a set of constraints, fall into the category of prescriptive models.

23.13 Big Data and Data Mining

Definition 23.30 Larger and more complex data sets are now often referred to as big data.

Volume refers to the amount of available data; velocity refers to the speed at which data is collected and processed; and variety refers to the different data types. The term data warehousing is used to refer to the process of capturing, storing, and maintaining the data.

Definition 23.31 Data Mining deals with methods for developing useful decision-making information from large databases. It can be defined as the automated extraction of predictive information from (large) databases.

Data mining relies heavily on statistical methodology such as multiple regression, logistic regression, and correlation.

23.14 Exercises

  • Table: 23.2
  • Table: 23.3
  • Table: 23.4
    • Who appears to be the market share leader and how the market shares are changing over time
      • Caution: Trend Analysis should be done by linear regression with cor(), lm() etc.

R

Load Data

xxComputers <- f_getObject("xxComputers", "C01-Computers.csv", "971fb6096e4f71e8185d3327a9033a10")
xxCordless <- f_getObject("xxCordless", "C01-Cordless.csv", "9991f612fe44f1c890440bd238084679")

f_getObject()

f_getObject <- function(x_name, x_source, x_md = "") {
  # #Debugging
  a07bug <- FALSE
  # #Read the File or Object
  # #Ex: xxCars <- f_getObject("xxCars", "S16-cars2.csv", "30051fb47f65810f33cb992015b849cc")
  # #tools::md5sum("xx.csv") OR tools::md5sum(paste0(.z$XL, "xx", ".txt"))
  #
  # #Path to the File 
  loc_src <- paste0(.z$XL, x_source)
  # #Path to the Object
  loc_rds <- paste0(.z$XL, x_name, ".rds")
  #
  # #x_file[1] FILENAME & x_file[2] FILETYPE
  x_file <- strsplit(x_source, "[.]")[[1]]
  #
  if(all(x_md == tools::md5sum(loc_src),  file.exists(loc_rds),
        file.info(loc_src)$mtime < file.info(loc_rds)$mtime)) {
      # #Read RDS if (exists, newer than source, source not modified i.e. passes md5sum)
      if(a07bug) print("A07 Flag 01: Reading from RDS")
      return(readRDS(loc_rds))
  } else if(!file.exists(loc_src)){
      message("ERROR: File does not exist! : ", loc_src, "\n")
      stop()
  } else if(x_file[2] == "csv") {
      # #Read CSV as a Tibble
      # #col_double(), col_character(), col_logical(), col_integer()
      # #DATETIME (EXCEL) "YYYY-MM-DD HH:MM:SS" imported as "UTC"
      tbl <- read_csv(loc_src, show_col_types = FALSE)
      # #Remove Unncessary Attributes
      attr(tbl, "spec") <- NULL
      attr(tbl, "problems") <- NULL
      # #Write Object as RDS
      saveRDS(tbl, loc_rds)
      # #Return Object
      if(a07bug) print("A07 Flag 02: Reading from Source and Saving as RDS")
      return(tbl)
  } else if(x_file[2] == "xlsx") {
      # #Read All Sheets of Excel in a list
      tbl <- lapply(excel_sheets(loc_src), read_excel, path = loc_src)
      # #Write Object as RDS
      saveRDS(tbl, loc_rds)
      # #Return Object
      return(tbl)
  } else {
      message("f_getObject(): UNKNOWN")
      stop()
  }
}

Transpose Tibble

bb <- tibble(Company = c("Hertz", "Dollar", "Avis"), 
              `2007` = c(327, 167, 204), `2008` = c(311, 140, 220),
              `2009` = c(286, 106, 300), `2010` = c(290, 108, 270))
# #Transpose Tibble: Note that the First Column Header is lost after Transpose
# #Longer
bb %>% pivot_longer(!Company, names_to = "Year", values_to = "Values")
## # A tibble: 12 x 3
##    Company Year  Values
##    <chr>   <chr>  <dbl>
##  1 Hertz   2007     327
##  2 Hertz   2008     311
##  3 Hertz   2009     286
##  4 Hertz   2010     290
##  5 Dollar  2007     167
##  6 Dollar  2008     140
##  7 Dollar  2009     106
##  8 Dollar  2010     108
##  9 Avis    2007     204
## 10 Avis    2008     220
## 11 Avis    2009     300
## 12 Avis    2010     270
# #Transpose
(ii <- bb %>% 
  pivot_longer(!Company, names_to = "Year", values_to = "Values") %>% 
  pivot_wider(names_from = Company, values_from = Values))
## # A tibble: 4 x 4
##   Year  Hertz Dollar  Avis
##   <chr> <dbl>  <dbl> <dbl>
## 1 2007    327    167   204
## 2 2008    311    140   220
## 3 2009    286    106   300
## 4 2010    290    108   270
# #Equivalent
stopifnot(identical(ii, 
                    bb %>% pivot_longer(-1) %>% 
                      pivot_wider(names_from = 1, values_from = value) %>% 
                      rename(., Year = name)))

Computers

Table 23.2: (C01T02) xxComputers
SN tablet cost os display_inch battery_hh cpu
1 Acer Iconia W510 599 Windows 10.1 8.5 Intel
2 Amazon Kindle Fire HD 299 Android 8.9 9.0 TI OMAP
3 Apple iPad 4 499 iOS 9.7 11.0 Apple
4 HP Envy X2 860 Windows 11.6 8.0 Intel
5 Lenovo ThinkPad Tablet 668 Windows 10.1 10.5 Intel
6 Microsoft Surface Pro 899 Windows 10.6 4.0 Intel
7 Motorola Droid XYboard 530 Android 10.1 9.0 TI OMAP
8 Samsung Ativ Smart PC 590 Windows 11.6 7.0 Intel
9 Samsung Galaxy Tab 525 Android 10.1 10.0 Nvidia
10 Sony Tablet S 360 Android 9.4 8.0 Nvidia

Mean

# #What is the average cost for the tablets #$582.90
cat(paste0("Avg. Cost for the tablets is = $", round(mean(bb$cost), digits = 1), "\n"))
## Avg. Cost for the tablets is = $582.9
#
# #Compare the average cost of tablets with different OS (Windows /Android) #$723.20 $428.5
(ii <- bb %>%
  group_by(os) %>%
  summarise(Mean = round(mean(cost), digits =1)) %>%
  arrange(desc(Mean)) %>% 
    mutate(Mean = paste0("$", Mean)))
## # A tibble: 3 x 2
##   os      Mean  
##   <chr>   <chr> 
## 1 Windows $723.2
## 2 iOS     $499  
## 3 Android $428.5
#
cat(paste0("Avg. Cost of Tablets with Windows OS is = ", 
  ii %>% filter(os == "Windows") %>% select(Mean), "\n"))
## Avg. Cost of Tablets with Windows OS is = $723.2

Percentage

# #What percentage of tablets use an Android operating system #40%
(ii <- bb %>%
  group_by(os) %>%
  summarise(PCT = n()) %>%
  mutate(PCT = 100 * PCT / sum(PCT)) %>% 
  arrange(desc(PCT)) %>% 
  mutate(PCT = paste0(PCT, "%")))
## # A tibble: 3 x 2
##   os      PCT  
##   <chr>   <chr>
## 1 Windows 50%  
## 2 Android 40%  
## 3 iOS     10%
#
cat(paste0("Android OS is used in ", 
  ii %>% filter(os == "Android") %>% select(PCT), " Tablets\n"))
## Android OS is used in 40% Tablets

Cordless

Table 23.3: (C01T03) xxCordless
SN brand model price overall_score voice_quality handset_on_base talk_time_hh
1 AT&T CL84100 60 73 Excellent Yes 7
2 AT&T TL92271 80 70 Very Good No 7
3 Panasonic 4773B 100 78 Very Good Yes 13
4 Panasonic 6592T 70 72 Very Good No 13
5 Uniden D2997 45 70 Very Good No 10
6 Uniden D1788 80 73 Very Good Yes 7
7 Vtech DS6521 60 72 Excellent No 7
8 Vtech CS6649 50 72 Very Good Yes 7

Mean

# #What is the average price for the cordless telephones 
cat(paste0("Avg. Price is = $", round(mean(bb$price), digits = 1), "\n"))
## Avg. Price is = $68.1
#
# #What is the average talk time for the cordless telephones
cat(paste0("Avg. Talk Time is = ", round(mean(bb$talk_time_hh), digits = 1), " Hours \n"))
## Avg. Talk Time is = 8.9 Hours

Percentage

# #What percentage of the cordless telephones have a voice quality of excellent 
(hh <- bb %>%
  group_by(voice_quality) %>%
  summarise(PCT = n()) %>%
  mutate(PCT = 100 * PCT / sum(PCT)) %>% 
    mutate(voice_quality = factor(voice_quality, 
                                  levels = c("Very Good", "Excellent"), ordered = TRUE)) %>% 
    arrange(desc(voice_quality)) %>% 
    mutate(PCT = paste0(PCT, "%")))
## # A tibble: 2 x 2
##   voice_quality PCT  
##   <ord>         <chr>
## 1 Excellent     25%  
## 2 Very Good     75%
#
cat(paste0("Percentage of 'Excellent' Voice Quality is = ", 
  hh %>% filter(voice_quality == "Excellent") %>% select(PCT), "\n"))
## Percentage of 'Excellent' Voice Quality is = 25%
#
# #Equivalent
print(bb %>%
 group_by(voice_quality) %>%
 summarise(PCT = n()) %>%
 mutate(PCT = prop.table(PCT) * 100))
## # A tibble: 2 x 2
##   voice_quality   PCT
##   <chr>         <dbl>
## 1 Excellent        25
## 2 Very Good        75

PCT 2

# #What percentage of the cordless telephones have a handset on the base 
bb %>%
  group_by(handset_on_base) %>%
  summarise(PCT = n()) %>%
  mutate(PCT = 100 * PCT / sum(PCT)) %>% 
  arrange(desc(PCT)) %>% 
  mutate(PCT = paste0(PCT, "%")) %>%
  filter(handset_on_base == "Yes") 
## # A tibble: 1 x 2
##   handset_on_base PCT  
##   <chr>           <chr>
## 1 Yes             50%

Cars

Transform

Table 23.4: (C01T04) Cars in Service
Company 2007 2008 2009 2010
Hertz 327 311 286 290
Dollar 167 140 106 108
Avis 204 220 300 270
Table 23.4: (C01T04B) Cars (Transposed)
Year Hertz Dollar Avis
2007 327 167 204
2008 311 140 220
2009 286 106 300
2010 290 108 270
bb <- tibble(Company = c("Hertz", "Dollar", "Avis"), 
              `2007` = c(327, 167, 204), `2008` = c(311, 140, 220),
              `2009` = c(286, 106, 300), `2010` = c(290, 108, 270))
# #Transpose Tibble: Note that the First Column Header is lost after Transpose
# #Longer
hh <- bb %>% pivot_longer(!Company, names_to = "Year", values_to = "Values")
# #Transpose
ii <- bb %>% 
  pivot_longer(!Company, names_to = "Year", values_to = "Values") %>% 
  pivot_wider(names_from = Company, values_from = Values)

TimeSeries

loc_png <- paste0(.z$PX, "C01P01", "-Cars-TimeSeries", ".png")
# #Load an Image
knitr::include_graphics(paste0(.z$PX, "C01P01", "-Cars-TimeSeries", ".png"))
(C01P01) Multiple Time Series Graph

Figure 23.1 (C01P01) Multiple Time Series Graph

Rowwise

# #who appears to be the market share leader
# #how the market shares are changing over time
print(ii)
## # A tibble: 4 x 4
##   Year  Hertz Dollar  Avis
##   <chr> <dbl>  <dbl> <dbl>
## 1 2007    327    167   204
## 2 2008    311    140   220
## 3 2009    286    106   300
## 4 2010    290    108   270
# #Row Total
jj <- ii %>% rowwise() %>% mutate(SUM = sum(c_across(where(is.numeric)))) %>% ungroup()
kk <- ii %>% mutate(SUM = rowSums(across(where(is.numeric))))
stopifnot(identical(jj, kk))
#
# #Rowwise Percentage Share 
ii %>% 
  rowwise() %>% 
  mutate(SUM = sum(c_across(where(is.numeric)))) %>% 
  ungroup() %>%
  mutate(across(2:4, ~ round(. * 100 / SUM, digits = 1), .names = "{.col}.{.fn}")) %>%
  mutate(across(ends_with(".1"), ~ paste0(., "%")))
## # A tibble: 4 x 8
##   Year  Hertz Dollar  Avis   SUM Hertz.1 Dollar.1 Avis.1
##   <chr> <dbl>  <dbl> <dbl> <dbl> <chr>   <chr>    <chr> 
## 1 2007    327    167   204   698 46.8%   23.9%    29.2% 
## 2 2008    311    140   220   671 46.3%   20.9%    32.8% 
## 3 2009    286    106   300   692 41.3%   15.3%    43.4% 
## 4 2010    290    108   270   668 43.4%   16.2%    40.4%

Pareto

# #Bar Plot
aa <- bb %>% 
  select(Company, `2010`) %>% 
  rename("Y2010" = `2010`) %>% 
  arrange(desc(.[2])) %>% 
  mutate(cSUM = cumsum(Y2010)) %>%
  mutate(PCT = 100 * Y2010 / sum(Y2010)) %>% 
  mutate(cPCT = 100 * cumsum(Y2010) / sum(Y2010)) %>% 
  mutate(across(Company, factor, levels = unique(Company), ordered = TRUE))
# #
pareto_chr <- setNames(c(aa$Y2010), aa$Company)
stopifnot(identical(pareto_chr, aa %>% pull(Y2010, Company)))
stopifnot(identical(pareto_chr, aa %>% select(1:2) %>% deframe()))
# #Save without using ggsave()
hh <- pareto_chr
loc_png <- paste0(.z$PX, "C01P02", "-Cars-Pareto", ".png")
cap_hh <- "C01P02"
#
if(!file.exists(loc_png)) {
  png(filename = loc_png) 
  #dev.control('enable') 
  pareto.chart(hh, xlab = "Company", ylab = "Cars", cumperc = seq(0, 100, by = 20),  
               ylab2 = "Cumulative Percentage", main = "Pareto Chart")  
  #title(main = ttl_hh, line = 2, adj = 0)
  title(sub = cap_hh, line = 4, adj = 1)
  C01P02 <- recordPlot()
  dev.off()
}
(C01P02) Pareto of Cars in 2010

Figure 23.2 (C01P02) Pareto of Cars in 2010

Validation

# #Summarised Packages and Objects
f_()
## [1] ""
#
difftime(Sys.time(), k_start)
## Time difference of 59.39836 secs

24 Descriptive Statistics

24.1 Overview

24.2 Summarizing Data for a Categorical Variable

Definition 24.1 A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several non-overlapping categories or classes.

The relative frequency of a class equals the fraction or proportion of observations belonging to a class i.e. it is out of 1 whereas ‘percent frequency’ is out of 100%.

Rather than showing the frequency of each class, the cumulative frequency distribution shows the number of data items with values less than or equal to the upper class limit of each class.

  • Bar Chart
    • Pareto Chart - ggplot() does not allow easy setup of dual axis
    • Stacked Bar Chart - do not use it if there are more than 2 categories
  • Pie Chart
    • Only use it if total is 100% and Categories are fewer than 5 or 6.

Bar & Pie

Table 24.1: (C02T02) Frequency Distribution
softdrink Frequency cSUM PROP PCT cPCT
Coca-Cola 19 19 38 38% 38%
Pepsi 13 32 26 26% 64%
Diet Coke 8 40 16 16% 80%
Dr. Pepper 5 45 10 10% 90%
Sprite 5 50 10 10% 100%
(C02P01 C02P02) Bar Chart and Pie Chart of Frequency(C02P01 C02P02) Bar Chart and Pie Chart of Frequency

Figure 24.1 (C02P01 C02P02) Bar Chart and Pie Chart of Frequency

Data

# #Frequency Distribution
aa <- tibble(softdrink = c("Coca-Cola", "Diet Coke", "Dr. Pepper", "Pepsi", "Sprite"), 
             Frequency = c(19, 8, 5, 13, 5))
#
# #Sort, Cummulative Sum, Percentage, and Cummulative Percentage
bb <- aa %>% 
  arrange(desc(Frequency)) %>% 
  mutate(cSUM = cumsum(Frequency)) %>%
  mutate(PROP = 100 * Frequency / sum(Frequency)) %>% 
  mutate(PCT = paste0(PROP, "%")) %>% 
  mutate(cPCT = paste0(100 * cumsum(Frequency) / sum(Frequency), "%"))

Bar

# #Sorted Bar Chart of Frequencies (Needs x-axis as Factor for proper sorting)
C02P01 <- bb %>% mutate(across(softdrink, factor, levels = rev(unique(softdrink)))) %>% {
  ggplot(data = ., aes(x = softdrink, y = Frequency)) +
  geom_bar(stat = 'identity', aes(fill = softdrink)) + 
  scale_y_continuous(sec.axis = sec_axis(~ (. / sum(bb$Frequency))*100, name = "Percentages", 
                       labels = function(b) { paste0(round(b, 0), "%")})) +
  geom_text(aes(label = paste0(Frequency, "\n(", PCT, ")")), vjust = 2, 
            colour = c(rep("black", 2), rep("white", nrow(bb)-2))) +
  k_gglayer_bar +   
  labs(x = "Soft Drinks", y = "Frequency", subtitle = NULL, 
         caption = "C02P01", title = "Bar Chart of Categorical Data")
}

Pie

# #Pie Chart of Frequencies (Needs x-axis as Factor for proper sorting)
C02P02 <- bb %>% mutate(across(softdrink, factor, levels = unique(softdrink))) %>% {
  ggplot(data = ., aes(x = '', y = Frequency, fill = rev(softdrink))) +
  geom_bar(stat = 'identity', width = 1, color = "white") +
  coord_polar(theta = "y", start = 0) +
  geom_text(aes(label = paste0(softdrink, "\n", Frequency, " (", PCT, ")")), 
            position = position_stack(vjust = 0.5), 
            colour = c(rep("black", 2), rep("white", nrow(bb)-2))) +
  k_gglayer_pie +   
  labs(caption = "C02P02", title = "Pie Chart of Categorical Data")
}

f_theme_gg()

f_theme_gg <- function(base_size = 14) {
# #Create a Default Theme 
  theme_bw(base_size = base_size) %+replace%
    theme(
      # #The whole figure
      plot.title = element_text(size = rel(1), face = "bold", 
                                margin = margin(0,0,5,0), hjust = 0),
      # #Area where the graph is located
      panel.grid.minor = element_blank(),
      panel.border = element_blank(),
      # #The axes
      axis.title = element_text(size = rel(0.85), face = "bold"),
      axis.text = element_text(size = rel(0.70), face = "bold"),
#      arrow = arrow(length = unit(0.3, "lines"), type = "closed"),
      axis.line = element_line(color = "black"),
      # The legend
      legend.title = element_text(size = rel(0.85), face = "bold"),
      legend.text = element_text(size = rel(0.70), face = "bold"),
      legend.key = element_rect(fill = "transparent", colour = NA),
      legend.key.size = unit(1.5, "lines"),
      legend.background = element_rect(fill = "transparent", colour = NA),
      # Labels in the case of facetting
      strip.background = element_rect(fill = "#17252D", color = "#17252D"),
      strip.text = element_text(size = rel(0.85), face = "bold", color = "white", margin = margin(5,0,5,0))
    )
}
# #Change default ggplot2 theme 
theme_set(f_theme_gg()) 
#
# #List of Specific sets. Note '+' is replaced by ','
k_gglayer_bar <- list(
  scale_fill_viridis_d(),
  theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
        panel.border = element_rect(colour = "black", fill=NA, size=1),
        legend.position = 'none', axis.title.y.right = element_blank())
)
#
# #Pie
k_gglayer_pie <- list(
  scale_fill_viridis_d(),
  #theme_void(),
  theme(#panel.background = element_rect(fill = "white", colour = "white"),
        #plot.background = element_rect(fill = "white",colour = "white"),
        axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        #panel.border = element_rect(colour = "black", fill=NA, size=1),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = 'none')
)
#
# #Histogram
k_gglayer_hist <- list(
  scale_fill_viridis_c(direction = -1, alpha = 0.9),
  theme(panel.grid.major.x = element_blank(), axis.line.y = element_blank(),
        panel.border = element_blank(), axis.ticks.y = element_blank(), 
        legend.position = 'none')
)
#
# #Scatter Plot Trendline
k_gglayer_scatter <- list(
  scale_fill_viridis_d(alpha = 0.9),
  theme(panel.grid.minor = element_blank(),
        panel.border = element_blank())
)
#
# #BoxPlot
k_gglayer_box <- list(
  scale_fill_viridis_d(alpha = 0.9),
  theme(panel.grid.major = element_line(colour = "#d3d3d3"),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(), panel.grid.major.x = element_blank(),
        #plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
        #text=element_text(family = "Tahoma"),
        #axis.title = element_text(face="bold"),
        #axis.text.x = element_text(colour="black", size = 11),
        #axis.text.y = element_text(colour="black", size = 9),
        axis.line = element_line(size=0.5, colour = "black"))
)
#

Errors

ERROR 24.1 Error: stat_count() can only have an x or y aesthetic.

Solution: Use geom_bar(stat = "identity")

24.3 Summarizing Data for a Quantitative Variable

A histogram is used for continuous data, where the bins represent ranges of data, while a bar chart is a plot of categorical variables.

The three steps necessary to define the classes for a frequency distribution with quantitative data are

  1. Determine the number of nonoverlapping classes (Bins)
    • Classes are formed by specifying ranges that will be used to group the data.
    • Approx. 5-20
  2. Determine the width of each class
    • The bins are usually specified as consecutive, non-overlapping intervals of a variable.
    • The bins (intervals) must be adjacent and are often (but not required to be) of equal size.
    • Approx. Bin Width = (Max - Min) / Number of Bins
    • Ex: For a dataset with min =12 & max =33, 5 bins of 10-14, …, 30-34 can be selected
  3. Determine the class
    • Class limits must be chosen so that each data item belongs to one and only one class
    • For categorical data, this was not required because each item naturally fell into a separate class
    • But with quantitative data, class limits are necessary to determine where each data value belongs
    • The ‘class midpoint’ is the value halfway between the lower and upper class limits. For a Bin of 10-14, 12 will be its mid-point.
  • Dot Plot
    • A horizontal axis shows the range for the data. Each data value is represented by a dot placed above the axis.
    • Caution: Avoid! Y-Axis is deceptive.
  • Histogram
    • Unlike a bar graph, a histogram contains no natural separation between the rectangles of adjacent classes.
  • Stem-and-Leaf Display (Not useful)

Histogram

set.seed(3)
# #Get Normal Data
bb <- tibble(aa = rnorm(n = 10000)) 
# #Histogram
# # '..count..' or '..x..'
C02P03 <- bb %>% {
  ggplot(data = ., aes(x = aa, fill = ..count..)) + 
  geom_histogram(bins = 50, position = "identity") +    
  k_gglayer_hist +
  labs(x = "Normal Data", y = "Count", subtitle = paste0("n = ", format(nrow(.), big.mark = ",")), 
       caption = "C02P03", title = "Histogram")
}
(C02P03) geom_histogram(): Histogram

Figure 24.2 (C02P03) geom_histogram(): Histogram

Dot Plot

# #Random Data
aa <- c(26, 35, 22, 47, 37, 5, 50, 49, 42, 2, 8, 7, 4, 47, 44, 35, 17, 49, 1, 48, 
        1, 27, 13, 26, 18, 44, 31, 4, 23, 47, 38, 28, 28, 5, 35, 39, 29, 13, 17, 
        38, 1, 8, 3, 30, 18, 37, 29, 39, 7, 28)
bb <- tibble(aa)
# #Dot Chart of Frequencies
C02P04 <- bb %>% {
  ggplot(., aes(x = aa)) +
  geom_dotplot(binwidth = 5, method = "histodot") + 
  theme(axis.line.y = element_blank(), panel.grid = element_blank(), axis.text.y = element_blank(),
        axis.ticks.y = element_blank(), axis.title.y =  element_blank()) + 
  labs(x = "Bins", subtitle = "Caution: Avoid! Y-Axis is deceptive.", 
       caption = "C02P04", title = "Dot Plot")
}
(C02P04) geom_dotplot(): Frequency Dot Chart

Figure 24.3 (C02P04) geom_dotplot(): Frequency Dot Chart

Get Frequency

24.4 Summarizing Data for Two Variables Using Tables

Definition 24.2 A crosstabulation is a tabular summary of data for two variables. It is used to investigate the relationship between them. Generally, one of the variable is categorical.
  • Simpson Paradox
    • The reversal of conclusions based on aggregate and unaggregated data is called Simpson paradox.
    • Ex: Table 24.2 shows the count of judgements that were ‘upheld’ or ‘reversed’ on appeal for two judges
      • 86% of the verdicts were upheld for Judge Abel, while 88% of the verdicts were upheld for Judge Ken. From this aggregated crosstabulation, we conclude that Judge Ken is doing the better job because a greater percentage of his verdicts are being upheld.
      • However, unaggregated crosstabulations show that in both types of courts (Common, Municipal) Judge Abel has higher percentage of ‘Upheld’ Verdicts (90.6% and 84.7%) - compared to Judge Ken (90% and 80%)
      • Thus, Abel has a better record because a greater percentage of his verdicts are being upheld in both courts.
      • This reversal of conclusions based on aggregated and unaggregated data illustrates Simpson paradox.
    • Cause
      • Note that for both judges the percentage of appeals that resulted in reversals was much higher in ‘Municipal’ than in ‘Common’ Court i.e. 15.3% vs. 9.4% for Abel and 20% vs. 10% for Ken.
      • Because Judge Abel tried a much higher percentage of his cases in ‘Municipal,’ the aggregated data favoured Judge Ken i.e. 118/150 for Abel vs. 25/125 for Ken.
      • Thus, for the original crosstabulation, we see that the ‘type of court’ is a hidden variable that cannot be ignored when evaluating the records of the two judges.
Table 24.2: (C02T01) Both Judges
Judge_Verdict xUpheld xReversed SUM
Abel 129 (86%) 21 (14%) 150
Ken 110 (88%) 15 (12%) 125
Total 239 (86.9%) 36 (13.1%) 275
Table 24.2: (C02T01A) Abel
Abel xUpheld xReversed SUM
Common 29 (90.6%) 3 (9.4%) 32
Municipal 100 (84.7%) 18 (15.3%) 118
Total 129 (86%) 21 (14%) 150
Table 24.2: (C02T01B) Ken
Ken xUpheld xReversed SUM
Common 90 (90%) 10 (10%) 100
Municipal 20 (80%) 5 (20%) 25
Total 110 (88%) 15 (12%) 125

Judges

# #Judges: Because we are evaluating 'Judges', they are the 'elements' and thus are the 'rows'
xxJudges <- tibble(Judge_Verdict = c('Abel', 'Ken'), Upheld = c(129, 110), Reversed = c(21, 15))
# #Uaggregated crosstab for both Judges in different types of Courts
xxKen <- tibble(Ken = c("Common", "Municipal "), 
                    Upheld = c(90, 20), Reversed = c(10, 5))
xxAbel <- tibble(Abel = c("Common", "Municipal "), 
                    Upheld = c(29, 100), Reversed = c(3, 18))

Transpose

# #Judges
aa <- tibble(Judge_Verdict = c('Abel', 'Ken'), Upheld = c(129, 110), Reversed = c(21, 15))
bb <- tibble(Verdict_Judge = c('Upheld', 'Reversed'), Abel = c(129, 21), Ken = c(110, 15))
aa
## # A tibble: 2 x 3
##   Judge_Verdict Upheld Reversed
##   <chr>          <dbl>    <dbl>
## 1 Abel             129       21
## 2 Ken              110       15
# #Transpose, Assuming First Column Header has "Row_Col" Type Format
ii <- aa %>% 
  `attr<-`("ColsLost", unlist(strsplit(names(.)[1], "_"))[1]) %>% 
  `attr<-`("RowsKept", unlist(strsplit(names(.)[1], "_"))[2]) %>% 
  pivot_longer(c(-1), 
               names_to = paste0(attributes(.)$RowsKept, "_", attributes(.)$ColsLost), 
               values_to = "Values") %>% 
  pivot_wider(names_from = 1, values_from = Values) %>% 
  `attr<-`("ColsLost", NULL) %>% `attr<-`("RowsKept", NULL) 
stopifnot(identical(bb, ii))
ii
## # A tibble: 2 x 3
##   Verdict_Judge  Abel   Ken
##   <chr>         <dbl> <dbl>
## 1 Upheld          129   110
## 2 Reversed         21    15
# #Testing for Reverse
ii <- bb %>% 
  `attr<-`("ColsLost", unlist(strsplit(names(.)[1], "_"))[1]) %>% 
  `attr<-`("RowsKept", unlist(strsplit(names(.)[1], "_"))[2]) %>% 
  pivot_longer(c(-1), 
               names_to = paste0(attributes(.)$RowsKept, "_", attributes(.)$ColsLost), 
               values_to = "Values") %>% 
  pivot_wider(names_from = 1, values_from = Values) %>% 
  `attr<-`("ColsLost", NULL) %>% `attr<-`("RowsKept", NULL) 
stopifnot(identical(aa, ii))

String Split

bb <- "Judge_Verdict"
# #Split String by strsplit(), output is list
(ii <- unlist(strsplit(bb, "_")))
## [1] "Judge"   "Verdict"
#
# #Split on Dot 
bb <- "Judge.Verdict"
# #Using character classes
ii <- unlist(strsplit(bb, "[.]"))
# #By escaping special characters
jj <- unlist(strsplit(bb, "\\."))
# #Using Options
kk <- unlist(strsplit(bb, ".", fixed = TRUE))
stopifnot(all(identical(ii, jj), identical(ii, kk)))

Attributes

  • Tibble
    • ‘problems’ attribute contains List of All Problems
      • problems(bb)
    • ‘spec’ attribute contains List of Columns and Types
      • spec(bb)
      • Caution: SHOW Snapshot at Import NOT the current Status, Better To Be Removed
jj <- ii <- bb <- aa
# #attr() adds or removes an attribute
attr(bb, "NewOne") <- "abc"
# #Using Backticks
ii <- `attr<-`(ii, "NewOne", "abc")
# #Using Pipe
jj <- jj %>% `attr<-`("NewOne", "abc")
#
stopifnot(all(identical(bb, ii), identical(bb, jj)))
#
# #List Attributes
names(attributes(bb))
## [1] "class"     "row.names" "names"     "NewOne"
#
# #Specific Attribute Value
attributes(bb)$NewOne
## [1] "abc"
#
# #Remove Attributes
attr(bb, "NewOne") <- NULL
ii <- `attr<-`(ii, "NewOne", NULL)
jj <- jj %>% `attr<-`("NewOne", NULL)
stopifnot(all(identical(bb, ii), identical(bb, jj)))

Total Row

# #(Deprecated) Issues: 
# #(1) bind_rows() needs two dataframes. Thus, first can be skipped in Pipe, But...
# #The second dataframe cannot be replaced with dot (.), it has to have a name
# #(2) Pipe usage inside function call was working but was a concern
# #(3) It introduced NA for that replace was needed as another step
ii <- aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum))) %>%
    mutate(across(1, ~ replace(., . %in% NA, "Total"))) 
#
# #(Deprecated) Works but needs ALL Column Names individually
jj <- aa %>% add_row(Judge_Verdict = "Total", Upheld = sum(.[ , 2]), Reversed = sum(.[ , 3]))
kk <- aa %>% add_row(Judge_Verdict = "Total", Upheld = sum(.$Upheld), Reversed = sum(.$Reversed))
#
# #(Deprecated) Removed the Multiple call to sum(). However, it needs First Column Header Name
ll <- aa %>% add_row(Judge_Verdict = "Total", summarise(., across(where(is.numeric), sum)))
# #(Deprecated) Replaced Column Header Name using "Tilde"
mm <- aa %>% add_row(summarise(., across(where(is.character), ~"Total")), 
               summarise(., across(where(is.numeric), sum)))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm)))
#
# #(Working): Minimised
aa %>% add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.numeric), sum)))
## # A tibble: 3 x 3
##   Judge_Verdict Upheld Reversed
##   <chr>          <dbl>    <dbl>
## 1 Abel             129       21
## 2 Ken              110       15
## 3 Total            239       36

Replace NA

# # USE '%in%' for NA, otherwise '==' works
bb <- aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum)))
#
ii <- bb %>% mutate(across(1, ~ replace(., . %in% NA, "Total"))) 
mm <- bb %>% mutate(across(1, ~ replace(., is.na(.), "Total"))) 
jj <- bb %>% mutate(Judge_Verdict = replace(Judge_Verdict, is.na(Judge_Verdict), "Total"))
kk <- bb %>% mutate(across(1, coalesce, "Total")) 
ll <- bb %>% mutate(across(1, ~ replace_na(.x, "Total")))
nn <- bb %>% mutate(across(1, replace_na, "Total"))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), 
              identical(ii, mm), identical(ii, nn)))
#
#   #Replace NA in a Factor
bb %>% 
  mutate(Judge_Verdict = factor(Judge_Verdict)) %>% 
  mutate(across(1, fct_explicit_na, na_level = "Total"))
## # A tibble: 3 x 3
##   Judge_Verdict Upheld Reversed
##   <fct>          <dbl>    <dbl>
## 1 Abel             129       21
## 2 Ken              110       15
## 3 Total            239       36

To Factor

#   #Convert to Factor
aa %>% mutate(Judge_Verdict = factor(Judge_Verdict))
## # A tibble: 2 x 3
##   Judge_Verdict Upheld Reversed
##   <fct>          <dbl>    <dbl>
## 1 Abel             129       21
## 2 Ken              110       15

Clipboard

# #Paste but do not execute
aa <- read_delim(clipboard())
# #Copy Excel Data, then execute the above command
#
# #Print its structure
dput(aa)
# #Copy the relevant values, headers in tibble()
bb <- tibble(  )
# #The above command will be the setup to generate this tibble anywhere

24.5 Exercise

C02E27

Data

ex27 <- tibble(Observation = 1:30, 
             x = c("A", "B", "B", "C", "B", "C", "B", "C", "A", "B", "A", "B", "C", "C", "C", 
                   "B", "C", "B", "C", "B", "C", "B", "C", "A", "B", "C", "C", "A", "B", "B"), 
             y = c(1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 
                   2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2))

CrossTab

bb <- ex27
str(bb)
## tibble [30 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Observation: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
##  $ x          : chr [1:30] "A" "B" "B" "C" ...
##  $ y          : num [1:30] 1 1 1 2 1 2 1 2 1 1 ...
# #Create CrossTab
bb <- bb %>% 
  count(x, y) %>% 
  pivot_wider(names_from = y, values_from = n, values_fill = 0)

PCT

bb
## # A tibble: 3 x 3
##   x       `1`   `2`
##   <chr> <int> <int>
## 1 A         5     0
## 2 B        11     2
## 3 C         2    10
# #Rowwise Percentage in Separate New Columns
bb %>% 
  mutate(SUM = rowSums(across(where(is.numeric)))) %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1), .names = "{.col}_Row" )) 
## # A tibble: 3 x 7
##   x       `1`   `2`   SUM `1_Row` `2_Row` SUM_Row
##   <chr> <int> <int> <dbl>   <dbl>   <dbl>   <dbl>
## 1 A         5     0     5   100       0       100
## 2 B        11     2    13    84.6    15.4     100
## 3 C         2    10    12    16.7    83.3     100
#
# #Rowwise Percentage in Same Columns
bb %>% 
  mutate(SUM = rowSums(across(where(is.numeric)))) %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1))) 
## # A tibble: 3 x 4
##   x       `1`   `2`   SUM
##   <chr> <dbl> <dbl> <dbl>
## 1 A     100     0     100
## 2 B      84.6  15.4   100
## 3 C      16.7  83.3   100
#
# #Equivalent
bb %>% 
  mutate(SUM = rowSums(across(where(is.numeric))),
         across(where(is.numeric), ~ round(. * 100 /SUM, 1))) 
## # A tibble: 3 x 4
##   x       `1`   `2`   SUM
##   <chr> <dbl> <dbl> <dbl>
## 1 A     100     0     100
## 2 B      84.6  15.4   100
## 3 C      16.7  83.3   100
#
# #Columnwise Percentage in Separate New Columns
bb %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /sum(.), 1), .names = "{.col}_Col" ))
## # A tibble: 3 x 5
##   x       `1`   `2` `1_Col` `2_Col`
##   <chr> <int> <int>   <dbl>   <dbl>
## 1 A         5     0    27.8     0  
## 2 B        11     2    61.1    16.7
## 3 C         2    10    11.1    83.3
# #Columnwise Percentage in Same Columns
bb %>% 
  mutate(across(where(is.numeric), ~ round(. * 100 /sum(.), 1)))
## # A tibble: 3 x 3
##   x       `1`   `2`
##   <chr> <dbl> <dbl>
## 1 A      27.8   0  
## 2 B      61.1  16.7
## 3 C      11.1  83.3

C02E28

Data

ex28 <- tibble(Observation = 1:20, 
        x = c(28, 17, 52, 79, 37, 71, 37, 27, 64, 53, 13, 84, 59, 17, 70, 47, 35, 62, 30, 43), 
        y = c(72, 99, 58, 34, 60, 22, 77, 85, 45, 47, 98, 21, 32, 81, 34, 64, 68, 67, 39, 28))

CrossTab

bb <- ex28
# #Rounding to the lowest 10s before min and to the highest 10s after max
nn <- 10L   
n_x <- seq(floor(min(bb$x) / nn) * nn, ceiling(max(bb$x) / nn) * nn, by = 20)
n_y <- seq(floor(min(bb$y) / nn) * nn, ceiling(max(bb$y) / nn) * nn, by = 20)
#
# #Labels in the format of [10-29]
lab_x <- paste0(n_x, "-", n_x + 20 - 1) %>% head(-1)
lab_y <- paste0(n_y, "-", n_y + 20 - 1) %>% head(-1)

# #Wider Table without Totals
ii <- bb %>% 
  mutate(x_bins = cut(x, breaks = n_x, right = FALSE, labels = lab_x),
         y_bins = cut(y, breaks = n_y, right = FALSE, labels = lab_y)) %>% 
  count(x_bins, y_bins) %>% 
  pivot_wider(names_from = y_bins, values_from = n, values_fill = 0, names_sort = TRUE)
print(ii)
## # A tibble: 4 x 5
##   x_bins `20-39` `40-59` `60-79` `80-99`
##   <fct>    <int>   <int>   <int>   <int>
## 1 10-29        0       0       1       4
## 2 30-49        2       0       4       0
## 3 50-69        1       3       1       0
## 4 70-89        4       0       0       0
# #Cross Tab with Total Column and Total Row
jj <- ii %>% 
  bind_rows(ii %>% summarise(across(where(is.numeric), sum))) %>% 
    mutate(across(1, fct_explicit_na, na_level = "Total")) %>% 
    mutate(SUM = rowSums(across(where(is.numeric))))
print(jj)
## # A tibble: 5 x 6
##   x_bins `20-39` `40-59` `60-79` `80-99`   SUM
##   <fct>    <int>   <int>   <int>   <int> <dbl>
## 1 10-29        0       0       1       4     5
## 2 30-49        2       0       4       0     6
## 3 50-69        1       3       1       0     5
## 4 70-89        4       0       0       0     4
## 5 Total        7       3       6       4    20

cut()

  • cut()
    • It slightly increases the range
    • ggplot2::cut_interval(), cut_width() do not increase the range
    • dig.lab : Options to exclude scientific notation
    • ordered_result : Option for ordered factor
# #Group Continuous Data to Categorical Bins by base::cut()
bb <- ex28
#
# #NOTE cut() increases the range slightly but ggplot functions do not
bb %>% mutate(x_bins = cut(x, breaks = 8)) %>% 
  pull(x_bins) %>% levels()
## [1] "(12.9,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84.1]"
# 
# #By default, it excludes the lower range, but it can be included by option
bb %>% mutate(x_bins = cut(x, breaks = 8, include.lowest = TRUE)) %>% 
  pull(x_bins) %>% levels()
## [1] "[12.9,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84.1]"
#
# #ggplot::cut_interval() makes n groups with equal range. There is a cut_number() also
bb %>% mutate(x_bins = cut_interval(x, n = 8)) %>% 
  pull(x_bins) %>% levels()
## [1] "[13,21.9]"   "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84]"
#
# #Specific Bins
bb %>% mutate(x_bins = cut(x, breaks = seq(10, 90, by = 10))) %>% 
  pull(x_bins) %>% levels()
## [1] "(10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
ii <- bb %>% mutate(x_bins = cut(x, breaks = seq(10, 90, by = 10), include.lowest = TRUE)) %>% 
  pull(x_bins) %>% levels()
print(ii)
## [1] "[10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
#
# #ggplot::cut_width() makes groups of width
bb %>% mutate(x_bins = cut_width(x, width = 10)) %>% 
  pull(x_bins) %>% levels()
## [1] "[5,15]"  "(15,25]" "(25,35]" "(35,45]" "(45,55]" "(55,65]" "(65,75]" "(75,85]"
#
# #Match cut_width() and cut()
jj <- bb %>% mutate(x_bins = cut_width(x, width = 10, boundary = 0)) %>% 
  pull(x_bins) %>% levels()
print(jj)
## [1] "[10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
stopifnot(identical(ii, jj))
#
# #Labelling
n_breaks <- seq(10, 90, by = 10)
n_labs <- paste0("*", n_breaks, "-", n_breaks + 10) %>% head(-1)

bb %>% mutate(x_bins = cut(x, breaks = n_breaks, include.lowest = TRUE, labels = n_labs)) %>% 
  pull(x_bins) %>% levels()
## [1] "*10-20" "*20-30" "*30-40" "*40-50" "*50-60" "*60-70" "*70-80" "*80-90"

24.6 Summarizing Data for Two Variables

  • Scatterplot and Trendline
  • Side by Side and Stacked Bar Charts

Data

xxCommercials <- tibble(Week = 1:10, 
                 Commercials = c(2, 5, 1, 3, 4, 1, 5, 3, 4, 2), 
                 Sales = c(50, 57, 41, 54, 54, 38, 63, 48, 59, 46))
f_setRDS(xxCommercials)
(C02P05) geom_point(), geom_smooth(), & stat_poly_eq()

Figure 24.4 (C02P05) geom_point(), geom_smooth(), & stat_poly_eq()

Trendline

bb <- xxCommercials 

# #Formula for Trendline calculation
k_gg_formula <- y ~ x
#
# #Scatterplot, Trendline alongwith its equation and R2 value
C02P05 <- bb %>% {
  ggplot(data = ., aes(x = Commercials, y = Sales)) + 
  geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
  stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")), 
               formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
               eq.x.rhs = "~italic(x)", parse = TRUE) +
  geom_point() + 
  labs(x = "Commercials", y = "Sales ($100s)", 
       subtitle = paste0("Trendline equation and R", '\u00b2', " value"), 
       caption = "C02P05", title = "Scatter Plot")
}

Validation


25 Numerical Measures

25.1 Overview

25.2 Definitions (Ref)

23.22 The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.

25.3 Number Theory

Definition 25.1 A number is a mathematical object used to count, measure, and label. Their study or usage is called arithmetic, a term which may also refer to number theory, the study of the properties of numbers.

Individual numbers can be represented by symbols, called numerals; for example, “5” is a numeral that represents the ‘number five.’

As only a relatively small number of symbols can be memorized, basic numerals are commonly organized in a numeral system, which is an organized way to represent any number. The most common numeral system is the Hindu-Arabic numeral system, which allows for the representation of any number using a combination of ten fundamental numeric symbols, called digits.

Counting is the process of determining the number of elements of a finite set of objects, i.e., determining the size of a set. Enumeration refers to uniquely identifying the elements of a set by assigning a number to each element.

Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events.

Sets

Formally, \(\mathbb{N} \to \mathbb{Z} \to \mathbb{Q} \to \mathbb{R} \to \mathbb{C}\)
Practically, \(\mathbb{N} \subset \mathbb{Z} \subset \mathbb{Q} \subset \mathbb{R} \subset \mathbb{C}\)

The natural numbers \(\mathbb{N}\) are those numbers used for counting and ordering. ISO standard begin the natural numbers with 0, corresponding to the non-negative integers \(\mathbb{N} = \{0, 1, 2, 3, \ldots \}\), whereas others start with 1, corresponding to the positive integers \(\mathbb{N^*} = \{1, 2, 3, \ldots \}\)

The set of integers \(\mathbb{Z}\) consists of zero (\({0}\)), the positive natural numbers \(\{1, 2, 3, \ldots \}\) and their additive inverses (the negative integers). Thus i.e., \(\mathbb{Z} = \{\ldots, -3, -2, -1, 0, 1, 2, 3, \ldots \}\). An integer is colloquially defined as a number that can be written without a fractional component.

Rational numbers \(\mathbb{Q}\) are those which can be expressed as the quotient or fraction p/q of two integers, a numerator p and a non-zero denominator q. Thus, Rational Numbers \(\mathbb{Q} = \{1, 2, 3, \ldots \}\)

A real number is a value of a continuous quantity that can represent a distance along a line. The real numbers include all the rational numbers \(\mathbb{Q}\), and all the irrational numbers. Thus, Real Numbers \(\mathbb{R} = \mathbb{Q} \cup \{\sqrt{2}, \sqrt{3}, \ldots\} \cup \{ \pi, e, \phi, \ldots \}\)

The complex numbers \(\mathbb{C}\) contain numbers which are expressed in the form \(a + ib\), where \({a}\) and \({b}\) are real numbers. These have two components the real numbers and a specific element denoted by \({i}\) (imaginary unit) which satisfies the equation \(i^2 = −1\).

Pi

The number Pi \(\pi = 3.14159\ldots\) is defined as the ratio of circumference of a circle to its diameter.

\[\pi = \int _{-1}^{1} \frac{dx}{\sqrt{1- x^2}} \tag{25.1}\]

\[e^{i\varphi}=\cos \varphi + i\sin \varphi \tag{25.2}\]

\[e^{i\pi} + 1 = 0 \tag{25.3}\]

# #Read OIS File for 20000 PI digits including integral (3) and fractional (14159...)
# #md5sum = "daf0b33a67fd842a905bb577957a9c7f"
tbl <- read_delim(file = paste0(.z$XL, "PI-OIS-b000796.txt"), 
  delim = " ", col_names = c("POS", "VAL"), col_types = list(POS = "i", VAL = "i"))
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxPI <- tbl
f_setRDS(xxPI)

e

Euler Number \(e = 2.71828\ldots\), is the base of the natural logarithm.

\[e = \lim_{n \to \infty} \left(1 + \frac{1}{n} \right)^{n} = \sum \limits_{n=0}^{\infty} \frac{1}{n!} \tag{25.4}\]

Phi

Two quantities are in the golden ratio \(\varphi = 1.618\ldots\) if their ratio is the same as the ratio of their sum to the larger of the two quantities.

\[\varphi^2 - \varphi - 1 = 0 \\ \varphi = \frac{1 + \sqrt{5}}{2} \tag{25.5}\]

Groups

Definition 25.2 A prime number is a natural number greater than 1 that is not a product of two smaller natural numbers. A natural number greater than 1 that is not prime is called a ‘composite number.’ 1 is neither a Prime nor a composite, it is a ‘Unit.’ Thus, by definition, Negative Integers and Zero cannot be Prime.
Definition 25.3 Parity is the property of an integer \(\mathbb{Z}\) of whether it is even or odd. It is even if the integer is divisible by 2 with no remainders left and it is odd otherwise. Thus, -2, 0, +2 are even but -1, 1 are odd. Numbers ending with 0, 2, 4, 6, 8 are even. Numbers ending with 1, 3, 5, 7, 9 are odd.
Definition 25.4 An integer \(\mathbb{Z}\) is positive if it is greater than zero, and negative if it is less than zero. Zero is defined as neither negative nor positive.
Definition 25.5 Mersenne primes are those prime number that are of the form \((2^n -1)\); that is, \(\{3, 7, 31, 127, \ldots \}\)

Mersenne primes:

  • \(\{3, 7, 31, 127, 8191, 131071, 524287, 2147483647, 2305843009213693951, \ldots \}\)
  • \(\{3 (2^{nd}), 7(4^{th}), 31(11^{th}), 127(31^{st}), 8191 (1028^{th}), 131071 (12,251^{th}), 524287 (43,390^{th}), \ldots \}\)
    • Mersenne primes with their position in List of Primes
  • \(2147483647 = (2^{231} − 1)\)
    • It is 105,097,565\(^{th}\) Prime, \(8^{th}\) Mersenne prime and is one of only four known double Mersenne primes.
    • It represents the largest value that a signed 32-bit integer field can hold.

25.4 Primes

Empty Vector

# #Create empty Vector with NA
aa <- rep(NA_integer_, 10)
print(aa)
##  [1] NA NA NA NA NA NA NA NA NA NA

f_isPrime()

f_isPrime <- function(x) {
  # #Check if the number is Prime
  if(!is.integer(x)) {
    cat("Error! Integer required. \n")
    stop()
  } else if(x <= 0L) {
    cat("Error! Positive Integer required. \n")
    stop()
  } else if(x > 2147483647L) {
    cat(paste0("Doubles are stored as approximation. Prime will not be calculated for value higher than '2147483647' \n"))
    stop()
  }
  # #However, this checks the number against ALL Smaller values including non-primes
  if(x == 2L || all(x %% 2L:ceiling(sqrt(x)) != 0)) {
    # # "seq.int(3, ceiling(sqrt(x)), 2)" is slower
    return(TRUE)
  } else {
    ## (any(x %% 2L:ceiling(sqrt(x)) == 0))
    ## (any(x %% seq.int(3, ceiling(sqrt(x)), 2) == 0))
    ## NOTE Further, if sequence starts from 3, add 2 also as a Prime Number
    return(FALSE)
  }
}
# #Vectorise Version
f_isPrimeV <- Vectorize(f_isPrime)
# #Compiled Version
f_isPrimeC <- cmpfun(f_isPrime)

Primes

# #There are 4 Primes in First 10, 25 in 100, 168 in 1000, 1229 in 10000.
# # Using Vectorise Version, get all the Primes
aa <- 1:10
bb <- aa[f_isPrimeV(aa)]
ii <- f_getPrimeUpto(10)
stopifnot(identical(bb, ii))
# #
xxPrime10 <- c(2, 3, 5, 7) |> as.integer()
# #
xxPrime100 <- c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 
               53, 59, 61, 67, 71, 73, 79, 83, 89, 97)  |> as.integer()
#
# #Generate List of ALL Primes till 524287 (i.e. Total 43,390 Primes)
xxPrimes <- f_getPrimeUpto(524287L)
# #Save as RDS
f_setRDS(xxPrimes)

Large Integers

# #NOTE: Assigning 2147483647L causes the Chunk to throw Warnings even with 'eval=FALSE'.
if(FALSE){
# #Assignment of 2305843009213693951L is NOT possible without Warning
# #Even within non-executing Block or with 'eval=FALSE' or suppressWarnings() or tryCatch()
# #It cannot be stored as integer, thus it is automatically converted to double
  #bb <- 2305843009213693951L
# #Warning: non-integer value 2305843009213693951L qualified with L; using numeric value 
# #NOTE that the value changed. It is explicitly NOT a prime anymore.
  #print(aa, digits = 20)
# #[1] 2305843009213693952
#
# #Assignment of 2147483647L is possible and direct printing in console works BUT
# #Its printing will also throw Warnings that are difficult to handle
# #Avoid Printing. Even within non-executing Block, it is affecting R Bookdown.
  aa <- 2147483647L
  #print(aa)
}

f_getPrime()

f_getPrimeUpto <- function(x){
  # #Get a Vector of Primes upto the given Number (Max. 524287)
  if(x < 2) {
    print("NOT ALLOWED!")
    return(NULL)
  } else if(x > 524287){
    print("Sadly, beyond this number it is difficult to generate the List of Primes!")
    return(NULL)
  }
  y <- 2:x
  i <- 1
  while (y[i] <= sqrt(x)) {
    y <-  y[y %% y[i] != 0 | y == y[i]]
    i <- i+1
  }
  return(y)
}

Benchmark

# #Compare any number of functions
result <- microbenchmark(
  sum(1:100)/length(1:100), 
  mean(1:100),
  #times = 1000,
  check = 'identical'
)
# #Print Table
print(result)
##Unit: microseconds
##                     expr min    lq    mean median     uq    max neval cld
## sum(1:100)/length(1:100) 1.2 1.301 1.54795 1.5005 1.6005  7.501   100  a 
##              mean(1:100) 5.9 6.001 6.56989 6.1010 6.2010 28.001   100   b
#
# #Boxplot of Benchmarking Result
#autoplot(result)
# #Above testcase showed a surprising result of sum()/length() being much faster than mean()
#
# #Or Compare Plot Rendering
if(FALSE) microbenchmark(print(jj), print(kk), print(ll), times = 2)

Sum-Mean

“ForLater” - Include rowsum(), rowSums(), colSums(), rowMeans(), colMeans() in this also.

# #Conclusion: use mean() because precision is difficult to achieve compared to speed
#
# #sum()/length() is faster than mean()
# #However, mean() does double pass, so it would be more accurate
# #mean.default() and var() compute means with an additional pass and so are more accurate
# #e.g. the variance of a constant vector is (almost) always zero 
# #and the mean of such a vector will be equal to the constant value to machine precision.
aa <- 1:100
#
microbenchmark(
  sum(aa)/length(aa), 
  mean(aa),
  mean.default(aa),
  .Internal(mean(aa)),
  #times = 1000,
  check = 'identical'
)
## Unit: nanoseconds
##                 expr  min   lq mean median   uq   max neval cld
##   sum(aa)/length(aa)  400  400  508    500  500  2000   100 a  
##             mean(aa) 3900 4000 4320   4100 4200 19400   100   c
##     mean.default(aa) 1400 1500 1619   1500 1600  5400   100  b 
##  .Internal(mean(aa))  500  500  580    550  600  3300   100 a
# #rnorm() generates random deviates of given length
set.seed(3)
aa <- rnorm(1e7)
str(aa)
##  num [1:10000000] -0.962 -0.293 0.259 -1.152 0.196 ...
#
# #NOTE manual calculation and mean() is NOT matching
identical(sum(aa)/length(aa), mean(aa))
## [1] FALSE
#
# #There is a slight difference
sum(aa)/length(aa) - mean(aa)
## [1] 0.00000000000000002355429

Remove Objects

if(FALSE) {
  # #Remove all objects matching a pattern
  rm(list = ls(pattern = "f_"))
}

Options Memory

# #Check the Current Options Value
getOption("expressions")
## [1] 5000
if(FALSE) {
  # #Change Value
  # #NOTE it did not help when recursive function failed
  # #Error: node stack overflow
  # #Error during wrapup: node stack overflow
  # #Error: no more error handlers available ...
  options(expressions=10000)
}

Vectorize()

# #To Vectorise a Function
f_isPrimeV <- Vectorize(f_isPrime)

Compiling

# #To Pre-Compile a Function for faster performance
f_isPrimeC <- cmpfun(f_isPrime)

Profiling

# #To Profile a Function Calls for improvements
Rprof("file.out")
f_isPrime(2147483647L)
#f_getPrimesUpto(131071L)
Rprof(NULL)
summaryRprof("file.out")

Legacy A

# #Functions to check for PRIME - All of them have various problems
# #"-3L -2L -1L 0L 1L 8L" FALSE "2L 3L ... 524287L 2147483647L" TRUE
isPrime_a <- function(x) {
  # #Fails for "2147483647L" Error: cannot allocate vector of size 8.0 Gb
  if (x == 2L) {
    return(TRUE)
  } else if (any(x %% 2:(x-1) == 0)) {
    return(FALSE)
  } else return(TRUE)
}

isPrime_b <- function(x){
  # #Comparison of Division and Integer Division by 1, 2, ..., x
  # #Fails for "2147483647L" Error: cannot allocate vector of size 16.0 Gb
  # #Fails for "-ve and zero" Error: missing value where TRUE/FALSE needed
  # vapply(x, function(y) sum(y / 1:y == y %/% 1:y), integer(1L)) == 2L
  if(sum(x / 1:x == x %/% 1:x) == 2) {
    return(TRUE) 
  } else return(FALSE)
}

isPrime_c <- function(x) {
  # #RegEx Slowest: Iit converts -ve values and coerce non-integers which may result in bugs
  x <- abs(as.integer(x))
  if(x > 8191L) {
    print("Do not run this with large values. RegEx is really slow.")
    stop()
  }
  !grepl('^1?$|^(11+?)\\1+$', strrep('1', x))
}

isPrime_d <- function(x) {
  # #Fails for "1" & returns TRUE
  # #Fails for "-ve and zero" Error: NA/NaN argument
  if(x == 2L || all(x %% 2L:max(2, floor(sqrt(x))) != 0)) {
    return(TRUE)
  } else return(FALSE)
}

isPrime_e <- function(x) {
  # #Fails for "-ve and zero" Error: NA/NaN argument
  # #This is the most robust which can be improved by conditional check for positive integers
  # #However, this checks the number against ALL Smaller values including non-primes
  if(x == 2L || all(x %% 2L:ceiling(sqrt(x)) != 0)) {
    # # "seq.int(3, ceiling(sqrt(x)), 2)" is slower
    return(TRUE)
  } else {
    ## (any(x %% 2L:ceiling(sqrt(x)) == 0))
    ## (any(x %% seq.int(3, ceiling(sqrt(x)), 2) == 0))
    ## NOTE Further, if sequence starts from 3, add 2 also as a Prime Number
    return(FALSE)
  }
}

Legacy B

# #131071 (12,251th), 524287 (43,390th), 2147483647 (105,097,565th)
aa <- 1:131071
# #Following works but only till 524287L, Memory Overflow ERROR for 2147483647L
bb <- aa[f_isPrimeV(aa)]

getPrimeUpto_a <- function(x){
  # #Extremely slow, cannot go beyond 8191L in benchmark testing
  if(x < 2) return("ERROR")
  y <- 2:x
  primes <- rep(2L, x)
  j <- 1L
  for (i in y) {
    if (!any(i %% primes == 0)) {
      j <- j + 1L
      primes[j] <- i
      #cat(paste0("i=", i, ", j=", j, ", Primes= ", paste0(head(primes, j), collapse = ", ")))
    }
    #cat("\n")
  }
  result <- head(primes, j)
  #str(result)
  #cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
  #cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
  return(result)
}

getPrimeUpto_b <- function(x){
# #https://stackoverflow.com/questions/3789968/
  # #This is much faster even from the "aa[f_isPrimeV(aa)]"
    if(x < 2) return("ERROR")
    y <- 2:x
    i <- 1
    while (y[i] <= sqrt(x)) {
        y <-  y[y %% y[i] != 0 | y == y[i]]
        i <- i+1
    }
    result <- y
    #str(result)
    #cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
    #cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
    return(result)
}

getPrimeUpto_c <- function(x) {
  # #Problems and Slow
  # #Returns a Vetor of Primes till the Number i.e. f_getPrimesUpto(7) = (2, 3, 5, 7)
  # #NOTE: f_getPrimesUpto(1) and f_getPrimesUpto(2) both return "2"
  if(!is.integer(x)) {
    cat("Error! Integer required. \n")
    stop()
  } else if(!identical(1L, length(x))) {
    cat("Error! Unit length vector required. \n")
    stop()
  } else if(x <= 0L) {
    cat("Error! Positive Integer required. \n")
    stop()
  } else if(x > 2147483647) {
    cat(paste0("Doubles are stored as approximation. Prime will not be calculated for value higher than '2147483647' \n"))
    stop()
  }
  
  # #Cannot create vector of length 2147483647L and also not needed that many
  # #ceiling(sqrt(7L)) return 3, however we need length 4 (2, 3, 5, 7)
  # #So, added extra 10
  #primes <- rep(NA_integer_, 10L + sqrt(2L))
  primes <- rep(2L, 10L + sqrt(2L))
  j <- 1L
  primes[j] <- 2L
  #
  i <- 2L
  while(i <= x) {
    # #na.omit() was the slowest step, so changed all NA to 2L in the primes
    #k <- na.omit(primes[primes <= ceiling(sqrt(i))])
    k <- primes[primes <= ceiling(sqrt(i))]
    if(all(as.logical(i %% k))) {
      j <- j + 1
      primes[j] <- i
    }  
    # #Increment with INTEGER Addition
    i = i + 1L
  }
  result <- primes[complete.cases(primes)]
  str(result)
  cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
  cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
  return(result)
}

getPrimeUpto_d <- function(n = 10L, i = 2L, primes = c(2L), bypass = TRUE){
  # #Using Recursion is NOT a good solution
  # #Function to return N Primes upto 1000 Primes (7919) or Max Value reaching 10000.
  if(i > 10000){
    cat("Reached 10000 \n")
    return(primes)
  }
  if(bypass) {
    maxN <- 1000L
    if(!is.integer(n)) {
      cat("Error! Integer required. \n")
      stop()
    } else if(!identical(1L, length(n))) {
      cat("Error! Unit length vector required. \n")
      stop()
    } else if(n <= 0L) {
      cat("Error! Positive Integer required. \n")
      stop()
    } else if(n > maxN) {
      cat(paste0("Error! This will calculate only upto ", maxN, " prime Numebers. \n"))
      stop()
    }
  }
  if(length(primes) < n) {
    if(all(as.logical(i %% primes[primes <= ceiling(sqrt(i))]))) {
      # #Coercing 0 to FALSE, Non-zero Values to TRUE
      # # "i %% 2L:ceiling(sqrt(i))" checks i agains all integers till sqrt(i)
      # # "primes[primes <= ceiling(sqrt(i))]" checks i against only the primes till sqrt(i)
      # #However, the above needs hardcoded 2L as prime so the vector is never empty
      # #Current Number is Prime, so include it in the vector and check the successive one
      f_getPrime(n, i = i+1, primes = c(primes, i), bypass = FALSE)
    } else {
      # #Current Number is NOT Prime, so check the successive one
      f_getPrime(n, i = i+1, primes = primes, bypass = FALSE)
    }
  } else {
    # #Return the vector when it reaches the count
    return(primes)
  }
}

25.5 Measures of Location

Definition 25.6 Measures of location are numerical summaries that indicate where on a number line a certain characteristic of the variable lies. Examples of the measures of location are percentiles and quantiles.
Definition 25.7 The measures of center are a special case of measures of location. These estimate where the center of a particular variable lies. Most common are Mean, Median, and Mode.

25.5.1 Mean

Definition 25.8 Given a data set \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({{x}_1, {x}_2, \ldots, {x}_n}\) divided by the count \({n}\).
  • Refer equation (25.6)
    • Sample mean is denoted by \({\overline{x}}\) (x bar) and Population mean is denoted by \({\mu}\).
    • Mean is the most commonly used measure of central location, even though it is influenced by extreme values.

\[\overline{x} = \frac{1}{n}\left (\sum_{i = 1}^n{{x}_i}\right ) = \frac{{x}_1 + {x}_2 + \cdots + {x}_n}{n} \tag{25.6}\]

In the mean calculation, normally each \({{x}_i}\) is given equal importance or weightage of \({1/n}\). However, in some instances the mean is computed by giving each observation a weight that reflects its relative importance. A mean computed in this manner is referred to as the weighted mean, as given in equation (25.7)

\[\overline{x} = \frac{\sum_{i=1}^n{w_ix_i}}{\sum_{i=1}^n{w_i}} \tag{25.7}\]

Caution: Unit of mean is same as unit of the variable e.g. cost_per_kg thus ‘w’ would be ‘kg.’

Mean

aa <- 1:10
# #Mean of First 10 Numbers
mean(aa)
## [1] 5.5

More

aa <- 1:10
# #Mean of First 10 Numbers
ii <- mean(aa)
print(ii)
## [1] 5.5
jj <- sum(aa)/length(aa)
stopifnot(identical(ii, jj))
#
# #Mean of First 10 Prime Numbers (is neither Prime nor Integer)
mean(f_getRDS(xxPrimes)[1:10])
## [1] 12.9
#
# #Mean of First 100 Digits of PI
f_getRDS(xxPI)[1:100, ] %>% pull(VAL) %>% mean()
## [1] 4.71

Weighted Mean

aa <- tibble(Purchase = 1:5, cost_per_kg = c(3, 3.4, 2.8, 2.9, 3.25), 
             kg = c(1200, 500, 2750, 1000, 800))
# #NOTE that unit of mean is same as unit of the variable e.g. cost_per_kg thus 'w' would be 'kg'
(ii <- sum(aa$cost_per_kg * aa$kg)/sum(aa$kg))
## [1] 2.96
jj <- with(aa, sum(cost_per_kg * kg)/sum(kg))
kk <- weighted.mean(x = aa$cost_per_kg, w = aa$kg)
stopifnot(all(identical(ii, jj), identical(ii, kk)))

25.5.2 Median

Definition 25.9 Median of a population is any value such that at most half of the population is less than the proposed median and at most half is greater than the proposed median.
  • Refer equation (25.8)
    • The median is the value in the middle when the data is sorted
    • For an odd number of observations, the median is the middle value.
    • For an even number of observations, the median is the average of the two middle values.
    • Although the mean is the more commonly used measure of central location, whenever a data set contains extreme values, the median is preferred.
      • The mean and median are different concepts and answer different questions.
        • Ex: Income - nearly always reported as median, but if we are looking the the ‘spending power of whole community’ it may no not be right.
    • The median is well-defined for any ordered data, and is independent of any distance metric.
      • The median can thus be applied to classes which are ranked but not numerical (ordinal), although the result might be halfway between classes if there is an even number of cases.

\[\begin{align} \text{if n is odd, } median(x) & = x_{(n + 1)/2} \\ \text{if n is even, } median(x) & = \frac{x_{(n/2)} + x_{(n/2) + 1}}{2} \end{align} \tag{25.8}\]

Median

aa <- 1:10 
# #Median of First 10 Numbers
median(aa)
## [1] 5.5

More

aa <- 1:10 
# #Median of First 10 Numbers
median(aa)
## [1] 5.5
#
# #Median of First 10 Prime Numbers (is NOT prime)
median(f_getRDS(xxPrimes)[1:10])
## [1] 12
#
# #Median of First 100 Digits of PI
f_getRDS(xxPI)[1:100, ] %>% pull(VAL) %>% median()
## [1] 4.5

25.5.3 Geometric Mean

Definition 25.10 The geometric mean \(\overline{x}_g\) is a measure of location that is calculated by finding the n^{th} root of the product of \({n}\) values.
  • Refer equation (25.9)
    • The geometric mean applies only to positive numbers
    • The geometric mean is often used for a set of numbers whose values are meant to be multiplied together or are exponential in nature
    • For all positive data sets containing at least one pair of unequal values, the harmonic mean is always the least of the three means, while the arithmetic mean is always the greatest of the three and the geometric mean is always in between.

\[\overline{x}_g = \left(\prod _{i=1}^{n} x_i\right)^{\frac{1}{n}} = \sqrt[{n}]{x_1 x_2 \ldots x_n} \tag{25.9}\]

Geometric Mean

aa <- 1:10
# #Geometric Mean of of First 10 Numbers
exp(mean(log(aa)))
## [1] 4.528729

More

aa <- 1:10
# #Geometric Mean of of First 10 Numbers
ii <- exp(mean(log(aa)))
jj <- prod(aa)^(1/length(aa))
stopifnot(identical(ii, jj))
#
# #Geometric Mean of First 10 Prime Numbers 
exp(mean(log(f_getRDS(xxPrimes)[1:10])))
## [1] 9.573889

25.5.4 Mode

Definition 25.11 The mode is the value that occurs with greatest frequency.
  • The median makes sense when there is a linear order on the possible values. Unlike median, the concept of mode makes sense for any random variable assuming values from a vector space.

Mode

# #Mode of First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ] %>% pull(VAL)
f_getMode(bb)
## [1] 9

More

# #Mode of First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #Get Frequency
bb %>% count(VAL)
## # A tibble: 10 x 2
##      VAL     n
##    <int> <int>
##  1     0     8
##  2     1     8
##  3     2    12
##  4     3    12
##  5     4    10
##  6     5     8
##  7     6     9
##  8     7     8
##  9     8    12
## 10     9    13
#
# #Get Mode
bb %>% pull(VAL) %>% f_getMode()
## [1] 9

f_getMode()

f_getMode <- function(x) {
  # #Calculate Statistical Mode
  # #NOTE: Single Length, All NA, Characters etc. have NOT been validated
  # #https://stackoverflow.com/questions/56552709
  # #https://stackoverflow.com/questions/2547402
  # #Remove NA
  if (anyNA(x)) {
    x <- x[!is.na(x)]
  }
  # #Get Unique Values
  ux <- unique(x)
  # #Match
  ux[which.max(tabulate(match(x, ux)))]
}

25.5.5 Percentiles

Definition 25.12 A percentile provides information about how the data are spread over the interval from the smallest value to the largest value. For a data set containing \({n}\) observations, the \(p^{th}\) percentile divides the data into two parts: approximately p% of the observations are less than the \(p^{th}\) percentile, and approximately (100 - p)% of the observations are greater than the \(p^{th}\) percentile.
  • Refer equation (25.10)
    • Percentile is the value which divides the data into two groups when it is sorted
    • Quartiles are specific percentiles of 25%, 50% and 75%
    • Median is 50% percentile
    • Caution: Excel “PERCENTILE.EXC” calculations match with type =6 option of quantile(), default is type =7

\[L_p = \frac{p}{100}(n + 1) \tag{25.10}\]

Percentiles

# #First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #50% Percentile of Digits i.e. Median
quantile(bb$VAL, 0.5)
## 50% 
## 4.5

More

# #First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #50% Percentile of Digits i.e. Median
ii <- quantile(bb$VAL, 0.5)
print(ii)
## 50% 
## 4.5
jj <- median(bb$VAL)
stopifnot(identical(unname(ii), jj))
# 
# #All Quartiles
quantile(bb$VAL, seq(0, 1, 0.25))
##   0%  25%  50%  75% 100% 
## 0.00 2.00 4.50 7.25 9.00
# #summary()
summary(bb$VAL)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.00    4.50    4.71    7.25    9.00
#
# #To Match with Excel "PERCENTILE.EXC" use type=6 in place of default type=7
quantile(bb$VAL, seq(0, 1, 0.25), type = 6)
##   0%  25%  50%  75% 100% 
## 0.00 2.00 4.50 7.75 9.00

25.6 Measures of Variability

Definition 25.13 Measures of spread (or the measures of variability) describe how spread out the data values are. Examples are Range, SD, mean absolute deviation, and IQR

In addition to measures of location, it is often desirable to consider measures of variability, or dispersion.

  • Range range()
    • (Largest value - Smallest value) i.e. max() - min()
    • Range is based on only two of the observations and thus is highly influenced by extreme values.
  • Interquartile Range (IQR) IQR()
    • The difference between the third quartile \((Q3, 75\%)\), and the first quartile \((Q1, 25\%)\)
    • IQR is a measure of variability, much more robust than the SD. IQR is less sensitive to the presence of the outliers.
    • It overcomes the dependency on extreme values
    • \(x_i \notin [Q_1 - 1.5 * \text{IQR}, Q_3 + 1.5 * \text{IQR}] \to x_i \in \text{Outlier}\)
    • It is assumed that any data point not in [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR] is an outlier
      • 1.5 multiplication gives us a window of \({\mu} \pm 2.7 {\sigma} \approx {\mu} \pm 3 {\sigma}\) which is quite closer to Normal Plot limit.
  • Mean Absolute Error (MAE)
    • \(\text{MAE} = \frac{\sum |x_i - \overline{x}|}{n}\)

25.6.1 Variance

Definition 25.14 The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.
  • Refer equation (25.11)
    • Sample Variance is denoted by \(s^2\) and Population Variance is denoted by \(\sigma^2\)
    • The variance is a measure of variability that utilizes all the data.
    • The difference between each \({x_i}\) and the mean (\(\overline{x}, \mu\)) is called a deviation about the mean i.e. (\(x_i - \overline{x}\)). Sum of deviation about the mean is always zero i.e. \(\sum (x_i - \overline{x}) =0\)
    • In the computation of the variance, the deviations about the mean are squared.
      • Because of the squaring involved, it is sensitive to the presence of outliers, leading analysts to prefer other measures of spread, such as the mean absolute deviation, in situations involving extreme values.

\[\begin{align} \sigma^2 &= \frac{1}{n} \sum _{i=1}^{n} \left(x_i - \mu \right)^2 \\ s^2 &= \frac{1}{n-1} \sum _{i=1}^{n} \left(x_i - \overline{x} \right)^2 \end{align} \tag{25.11}\]

25.6.2 Standard Deviation

Definition 25.15 The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.
  • Refer equation (25.12)
    • Standard deviation for sample is denoted by \({s}\) and for Population by \({\sigma}\)
    • It can be interpreted as the “typical” distance between a field value and the mean
    • The coefficient of variation is a relative measure of variability. It measures the standard deviation relative to the mean. It is given in percentage as \(100 \times \sigma / \mu\)

\[\begin{align} \sigma &= \sqrt{\frac{1}{N} \sum_{i=1}^N \left(x_i - \mu\right)^2} \\ {s} &= \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \overline{x}\right)^2} \end{align} \tag{25.12}\]

25.7 Measures of Distribution Shape

25.7.1 Skewness

Definition 25.16 Skewness \((\tilde{\mu}_{3})\) is a measure of the shape of a data distribution. It is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
Definition 25.17 A tail refers to the tapering sides at either end of a distribution curve.
  • Data skewed to the left result in negative skewness; a symmetric data distribution results in zero skewness; and data skewed to the right result in positive skewness.
  • \(\tilde{\mu}_{3}\) is the \(3^{rd}\) standardized moment
    • Side topic: A standardized moment of a probability distribution is a moment (normally a higher degree central moment) that is normalized. The normalization is typically a division by an expression of the standard deviation which renders the moment scale invariant.
      • \(\tilde{\mu}_{1} = 0\), because the first moment about the mean is always zero.
      • \(\tilde{\mu}_{2} = 1\), because the second moment about the mean is equal to the variance \({\sigma}^2\).
      • \(\tilde{\mu}_{3}\) is a measure of skewness
      • \(\tilde{\mu}_{4}\) refers to the Kurtosis

Refer figure 25.1

  • No skew: (symmetric)
    • A unimodal distribution with zero value of skewness does not imply that this distribution is symmetric necessarily. However, a symmetric unimodal or multimodal distribution always has zero skewness. - The normal distribution has a skewness of zero. But reverse may not be true.
  • Negative skew: (left-skewed, left-tailed, or skewed to the left)
    • The left tail is longer, thus the ‘left’ refers to the left tail being drawn out
    • The curve itself appears to be leaning to the right i.e. the mass of the distribution is concentrated on the right of the figure
  • Positive skew: (right-skewed, right-tailed, or skewed to the right)
    • The right tail is longer, thus the ‘right’ refers to the right tail being drawn out
    • The curve itself appears to be leaning to the left i.e. the mass of the distribution is concentrated on the left of the figure
  • Relationship of mean and median
    • The skewness is not directly related to the relationship between the mean and median: a distribution with negative skew can have its mean greater than or less than the median, and likewise for positive skew
    • However, generally the skew can be calculated as \(({\mu} -{\nu})/\sigma\), where \({\nu}\) is median
  • Application:
    • Skewness indicates the direction and relative magnitude of deviation from the normal distribution.
    • It indicates the direction of outliers
    • With pronounced skewness, standard statistical inference procedures such as a confidence interval for a mean will be not only incorrect, in the sense that the true coverage level will differ from the nominal (e.g., 95%) level, but they will also result in unequal error probabilities on each side.

Skewness is given by the equation (25.13), which is being shown here because it looked cool has deep meaning

\[Skew = \frac{\tfrac {1}{n}\sum_{i=1}^{n}(x_{i}-{\overline{x}})^{3}}{\left[\tfrac {1}{n-1}\sum_{i=1}^{n}(x_{i}-{\overline{x}})^{2} \right]^{3/2}} \tag{25.13}\]

Charts

(C03P06 C03P04 C03P05) (Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)(C03P06 C03P04 C03P05) (Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)(C03P06 C03P04 C03P05) (Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)

Figure 25.1 (C03P06 C03P04 C03P05) (Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)

skewness()

# #Skewness Calculation: Package "e1071" (Package "moments" deprecated)
even_skew <- c(49, 50, 51)
pos_skew <- c(even_skew, 60)
neg_skew <- c(even_skew, 40)
skew_lst <- list(even_skew, pos_skew, neg_skew)
# #Mean, Median, SD
cat(paste0("Mean (neg, even, pos): ", 
           paste0(vapply(skew_lst, mean, numeric(1)), collapse = ", "), "\n"))
## Mean (neg, even, pos): 50, 52.5, 47.5
cat(paste0("Median (neg, even, pos): ", 
           paste0(vapply(skew_lst, median, numeric(1)), collapse = ", "), "\n"))
## Median (neg, even, pos): 50, 50.5, 49.5
cat(paste0("SD (neg, even, pos): ", paste0(
           round(vapply(skew_lst, sd, numeric(1)), 1), collapse = ", "), "\n"))
## SD (neg, even, pos): 1, 5.1, 5.1
#
cat(paste0("Skewness (neg, even, pos): ", paste0(
           round(vapply(skew_lst, e1071::skewness, numeric(1)), 1), collapse = ", "), "\n"))
## Skewness (neg, even, pos): 0, 0.7, -0.7
cat(paste0("Kurtosis (neg, even, pos): ", paste0(
           round(vapply(skew_lst, e1071::kurtosis, numeric(1)), 1), collapse = ", "), "\n"))
## Kurtosis (neg, even, pos): -2.3, -1.7, -1.7

Normal Exp Beta

# #Skewness Calculation: Package "e1071" (Package "moments" deprecated)
dis_lst <- list(xxNormal, xxExp, xxBeta)
#
# #Skewness: Normal has value close to 3 Kurtosis (=0 excess Kurtosis)
# #Skewness "e1071" has Type = 3 as default. Its Type = 1 matches "moments"
# #Practically, Normal has (small) NON-Zero Positive Skewness
skew_e_t3 <- vapply(dis_lst, e1071::skewness, numeric(1))
skew_e_t2 <- vapply(dis_lst, e1071::skewness, type = 2, numeric(1))
skew_e_t1 <- vapply(dis_lst, e1071::skewness, type = 1, numeric(1))
skew_mmt <-  vapply(dis_lst, moments::skewness, numeric(1))
stopifnot(identical(round(skew_e_t1, 10), round(skew_mmt, 10)))
cat(paste0("e1071: Type = 1 Skewness (Normal, Exp, Beta): ", 
           paste0(round(skew_e_t1, 4), collapse = ", "), "\n"))
## e1071: Type = 1 Skewness (Normal, Exp, Beta): 0.0407, 2.0573, -0.6279
cat(paste0("e1071: Type = 2 Skewness (Normal, Exp, Beta): ", 
           paste0(round(skew_e_t2, 4), collapse = ", "), "\n"))
## e1071: Type = 2 Skewness (Normal, Exp, Beta): 0.0407, 2.0576, -0.628
cat(paste0("e1071: Type = 3 Skewness (Normal, Exp, Beta): ", 
           paste0(round(skew_e_t3, 4), collapse = ", "), "\n"))
## e1071: Type = 3 Skewness (Normal, Exp, Beta): 0.0407, 2.057, -0.6278
#
# #Formula: (sigma_ (x_i - mu)^3) /(n * sd^3)
bb <- xxNormal
skew_man <- sum({bb - mean(bb)}^3) / {length(bb) * sd(bb)^3}
cat(paste0("(Manual) Skewness of Normal: ", round(skew_man, 4), 
           " (vs. e1071 Type 3 = ", round(skew_e_t3[1], 4), ") \n"))
## (Manual) Skewness of Normal: 0.0407 (vs. e1071 Type 3 = 0.0407)

Distributions

set.seed(3)
nn <- 10000L
# #Normal distribution is symmetrical
xxNormal <- rnorm(n = nn, mean = 0, sd = 1)
# #The exponential distribution is positive skew
xxExp <- rexp(n = nn, rate = 1)
# #The beta distribution with hyper-parameters α=5 and β=2 is negative skew
xxBeta <- rbeta(n = nn, shape1 = 5, shape2 = 2)
#
# #Save
f_setRDS(xxNormal)
f_setRDS(xxExp)
f_setRDS(xxBeta)
#f_getRDS(xxNormal)
# #Get the Distributions
xxNormal <- f_getRDS(xxNormal)
xxExp <- f_getRDS(xxExp)
xxBeta <- f_getRDS(xxBeta)

Density

# #Density Curve
# #Assumes 'hh' has data in 'ee'. In: cap_hh
#Basics
mean_hh <- mean(hh$ee)
sd_hh <- sd(hh$ee)
#
skew_hh <- skewness(hh$ee)
kurt_hh <- kurtosis(hh$ee)
# #Get Quantiles and Ranges of mean +/- sigma 
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
#
# # Latex Labels 
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'), 
                TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'), 
                TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
C03 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + 
  geom_density(alpha = 0.2, colour = "#21908CFF") + 
  geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') + 
  geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') + 
  geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') + 
  geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') + 
  scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) + 
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), 
        axis.ticks = element_blank(), 
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(), 
        axis.line.y = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank()) + 
  labs(x = "x", y = "Density", 
       subtitle = paste0("Mean = ", round(mean_hh, 3), "; SD = ", round(sd_hh, 3), "; Skewness = ", round(skew_hh, 3), "; Kurtosis = ", round(kurt_hh, 3)), 
        caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C03)
rm(C03)

25.7.2 Kurtosis

Definition 25.18 Kurtosis \((\tilde{\mu}_{4})\) is a measure of the “tailedness” of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution. For \({\mathcal{N}}_{(\mu, \, \sigma)}\), kurtosis is 3 and excess kurtosis is 0 (i.e. subtract 3).

Distributions with zero excess kurtosis are called mesokurtic. The most prominent example of a mesokurtic distribution is the normal distribution. The kurtosis of any univariate normal distribution is 3.

Distributions with kurtosis less than 3 are said to be platykurtic. It means the distribution produces fewer and less extreme outliers than does the normal distribution. An example of a platykurtic distribution is the uniform distribution, which does not produce outliers.

Distributions with kurtosis greater than 3 are said to be leptokurtic. An example of a leptokurtic distribution is the Laplace distribution, which has tails that asymptotically approach zero more slowly than a Gaussian, and therefore produces more outliers than the normal distribution.

Kurtosis is the average (or expected value) of the standardized data raised to the fourth power. Any standardized values that are less than 1 (i.e., data within one standard deviation of the mean, where the “peak” would be), contribute virtually nothing to kurtosis, since raising a number that is less than 1 to the fourth power makes it closer to zero. The only data values that contribute to kurtosis in any meaningful way are those outside the region of the peak; i.e., the outliers. Therefore, kurtosis measures outliers only; it measures nothing about the “peak.”

The sample kurtosis is a useful measure of whether there is a problem with outliers in a data set. Larger kurtosis indicates a more serious outlier problem.

# #Kurtosis Calculation: Package "e1071" (Package "moments" deprecated)
dis_lst <- list(xxNormal, xxExp, xxBeta)
#
# #Kurtosis: Normal has value close to 3 Kurtosis (=0 excess Kurtosis)
# #Kurtosis "e1071" has Type = 3 as default. Its Type = 1 matches "moments" with difference of 3
kurt_e_t3 <- vapply(dis_lst, e1071::kurtosis, numeric(1))
kurt_e_t2 <- vapply(dis_lst, e1071::kurtosis, type = 2, numeric(1))
kurt_e_t1 <- vapply(dis_lst, e1071::kurtosis, type = 1, numeric(1))
kurt_mmt <-  vapply(dis_lst, moments::kurtosis, numeric(1))
stopifnot(identical(round(kurt_e_t1, 10), round(kurt_mmt - 3, 10)))
cat(paste0("e1071: Type = 1 Kurtosis (Normal, Exp, Beta): ", 
           paste0(round(kurt_e_t1, 4), collapse = ", "), "\n"))
## e1071: Type = 1 Kurtosis (Normal, Exp, Beta): -0.0687, 6.3223, -0.106
cat(paste0("e1071: Type = 2 Kurtosis (Normal, Exp, Beta): ", 
           paste0(round(kurt_e_t2, 4), collapse = ", "), "\n"))
## e1071: Type = 2 Kurtosis (Normal, Exp, Beta): -0.0682, 6.326, -0.1055
cat(paste0("e1071: Type = 3 Kurtosis (Normal, Exp, Beta): ", 
           paste0(round(kurt_e_t3, 4), collapse = ", "), "\n"))
## e1071: Type = 3 Kurtosis (Normal, Exp, Beta): -0.0693, 6.3204, -0.1066
#
# #Formula: (sigma_ (x_i - mu)^4) /(n * sd^4)
bb <- xxNormal
kurt_man <- {sum({bb - mean(bb)}^4) / {length(bb) * sd(bb)^4}} - 3
cat(paste0("(Manual) Kurtosis of Normal: ", round(kurt_man, 4), 
           " (vs. e1071 Type 3 = ", round(kurt_e_t3[1], 4), ") \n"))
## (Manual) Kurtosis of Normal: -0.0693 (vs. e1071 Type 3 = -0.0693)

25.8 Relative Location

25.8.1 z-Scores

Measures of relative location help us determine how far a particular value is from the mean. By using both the mean and standard deviation, we can determine the relative location of any observation.

Definition 25.19 A sample of \({n}\) observations given by \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\) have a sample mean \({\overline{x}}\) and the sample standard deviation, \({s}\).
Definition 25.20 The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.
  • Refer equation (25.14) (Similar to equation (28.4))
    • For example, \(z_1 = 1.2\) would indicate that \({x_1}\) is 1.2 standard deviations greater than the sample mean. Similarly, \(z_2 = -0.5\) would indicate that \({x_2}\) is 0.5 standard deviation less than the sample mean.
    • A z-score greater than zero occurs for observations with a value greater than the mean, and a z-scoreless than zero occurs for observations with a value less than the mean.
    • A z-score of zero indicates that the value of the observation is equal to the mean.
    • The z-score for any observation can be interpreted as a measure of the relative location of the observation in a data set.
    • The process of converting a value for a variable to a z-score is often referred to as a z transformation or scaling.

\[z_i = \frac{{x}_i - {\overline{x}}}{{s}} \tag{25.14}\]

NOTE: “Z statistic” is a special case of “Z critical” because \(\sigma/\sqrt{n}\) is the ‘standard error of the sample mean’ which means that it is a standard deviation. Rather than (eg) a known population standard deviation or even just sample standard deviation, per CLT, it is the standard deviation of the sample mean. The ‘critical Z’ (i.e. standard score) is something than can always be computed (“a general case”) whenever there is a mean and standard deviation; it translates X into a Z variable with zero mean and unit variance. (it “imposes normality” when the data may not be normal!). The “Z statistic” similarly standardizes as a special case where it is standardizing the sample mean.

Definition 25.21 Computing a z-score requires knowing the mean \({\mu}\) and standard deviation \({\sigma}\) of the complete population to which a data point belongs. If one only has a sample of observations from the population, then the analogous computation with sample mean \({\overline{x}}\) and sample standard deviation \({s}\) yields the t-statistic.

Caution:

  • Scaling does influence the interpretation of the parameters when doing many statistical analyses (regression, PCA etc) so the decision to scale should be based on how you want to interpret your parameters.
    • Although the shapes of distributions are unchanged by scaling, the distributions themselves are definitely changed.
    • Ex: After scaling a Poisson distribution, it would no longer be a Poisson distribution
    • However, scaling will not change the underlying distribution of the variable nor will it influence (positively or negatively) the violations of model assumptions.
xxflights <- f_getRDS(xxflights)
bb <- na.omit(xxflights$air_time)
# Scaling
ii <- {bb - mean(bb)} / sd(bb)
str(ii)
##  num [1:327346] 0.8145 0.8145 0.0994 0.3449 -0.3702 ...
##  - attr(*, "na.action")= 'omit' int [1:9430] 472 478 616 644 726 734 755 839 840 841 ...
# #scale() gives a Matrix with original mean and sd as its attribute
jj <- scale(bb)
str(jj)
##  num [1:327346, 1] 0.8145 0.8145 0.0994 0.3449 -0.3702 ...
##  - attr(*, "scaled:center")= num 151
##  - attr(*, "scaled:scale")= num 93.7
stopifnot(identical(as.vector(ii), as.vector(jj)))
#
hh <- tibble(ee = as.vector(jj))
ttl_hh <- "Flights: Air Time (Scaled)"
cap_hh <- "C03P08" #iiii

Image

(C03P07 C03P08) Before and After Scaling(C03P07 C03P08) Before and After Scaling

Figure 25.2 (C03P07 C03P08) Before and After Scaling

Histogram

# #hh$ee ttl_hh cap_hh
#
C03 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
  geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') + 
  geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
  annotate(geom = "text", x = mean(.[[1]]), y = -Inf, 
           label = TeX(r'($\bar{x}$)', output = "character"), 
           color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE) +
  coord_cartesian(ylim = c(0, 35000)) +
  theme(plot.title.position = "panel") + 
  labs(x = "x", y = "Frequency", 
       subtitle = paste0("(Mean= ", round(mean(.[[1]]), 3), 
                         "; SD= ", round(sd(.[[1]]), 3),
                         ")"), 
      caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C03)
rm(C03)

Annotate

if(FALSE){
# #check_overlap = TRUE works for de-blurring. However, it still checks each point thus slow
geom_text(aes(label = TeX(r'($\bar{x}$)', output = "character"), 
              x = mean(.data[["ee"]]), y = -Inf),
          color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE, check_overlap = TRUE) 
# #Create your own dataset
geom_text(data = tibble(x = mean(.[[1]]), y = -Inf, 
                        label = TeX(r'($\bar{x}$)', output = "character")), 
          aes(x = x, y = y, label = label), 
          color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE ) 
# #Or Equivalent
ggplot2::annotate(geom = "text", x = mean(.[[1]]), y = -Inf, 
                  label = TeX(r'($\bar{x}$)', output = "character"), 
                  color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE) 
#
ggpp::annotate(geom = "text", x = mean(.[[1]]), y = -Inf, 
               label = TeX(r'($\bar{x}$)', output = "character"), 
               color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE) 
}

Colours

# #List All Colour Names in R
str(colors())
##  chr [1:657] "white" "aliceblue" "antiquewhite" "antiquewhite1" "antiquewhite2" "antiquewhite3" ...
# #Packages: viridis, scales, viridisLite
# #Show N Colours with Max. Contrast
q_colors <- 5
# #Display Colours
if(FALSE) show_col(viridis_pal()(q_colors))
# #Get the Viridis i.e. "D" palette Hex Values for N Colours
v_colors <-  viridis(q_colors, option = "D")
v_colors
## [1] "#440154FF" "#3B528BFF" "#21908CFF" "#5DC863FF" "#FDE725FF"
#
# #Diverging Colour Palette from 'RColorBrewer'
# #Hex Values
brewer.pal(3, "BrBG")
## [1] "#D8B365" "#F5F5F5" "#5AB4AC"
if(FALSE) display.brewer.pal(3, "BrBG")

25.8.2 Chebyshev Theorem

Definition 25.22 Chebyshev Theorem can be used to make statements about the proportion of data values that must be within a specified number of standard deviations \({\sigma}\), of the mean \({\mu}\).
  • Refer to 25.22
    • Chebyshev Theorem: At least \((1-1/z^2)\) of the data values must be within z standard deviations of the mean, where z is any value greater than 1.
      • Thus, at least 75% of the data values must be within \(\overline{x} \pm 2s\), 89% within \(\overline{x} \pm 3s\), and 94% \(\overline{x} \pm 4s\).
    • Chebyshev theorem can be applied to any data set regardless of the shape of the distribution of the data.
    • Ex: Test scores of 100 students have \((\mu = 70, \sigma = 5)\)
    • How many students had test scores between 60 and 80
      • From equation (25.14), \(z_{60} = \frac{60 - 70}{5} = -2\)
      • Similarly, \(z_{80} = \frac{80 - 70}{5} = +2\)
      • According to theorem 25.22, values that must be within \({z}\) standard deviation are
        • \({(1-1/z^2) = (1 - 1/2^2) = 0.75 = 75\%}\)
        • i.e. 75 students must have test scores between 60 and 80
    • How many students had test scores between 58 and 82
      • \(z_{58} = -2.4, z_{82} = +2.4\)
      • \({(1 - 1/2.4^2) \approx 0.826 \approx 83\%}\)
        • i.e. 83 students must have test scores between 58 and 82

25.8.3 Empirical Rule

Definition 25.23 Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.
  • According to the empirical rule, for a Normal distribution
    • \(Pr({\mu} - 1{\sigma} \leq {X} \leq {\mu} + 1{\sigma}) \approx 68.27\%\)
    • \(Pr({\mu} - 2{\sigma} \leq {X} \leq {\mu} + 2{\sigma}) \approx 95.45\%\) i.e. mostly
    • \(Pr({\mu} - 3{\sigma} \leq {X} \leq {\mu} + 3{\sigma}) \approx 99.73\%\) i.e. almost all data values

25.9 Outliers

Definition 25.24 Outliers are data points or observations that does not fit the trend shown by the remaining data. These differ significantly from other observations. Unusually large or small values are commonly found to be outliers.
  • Reasons
    • Outliers can occur by chance in any distribution, but they often indicate either measurement error or that the population has a heavy-tailed distribution. A frequent cause of outliers is a mixture of two distributions, which may be two distinct sub-populations.
    • In most larger samplings of data, some data points will be further away from the sample mean than what is deemed reasonable. However, a small number of outliers is to be expected (and not due to any anomalous condition).
    • Estimators capable of coping with outliers are said to be robust: the median is a robust statistic of central tendency, while the mean is not. However, the mean is generally a more precise estimator.
  • Outliers represent observations that are suspect and warrant careful examination.
    • They may represent erroneous data; if so, the data should be corrected.
    • They may signal a violation of model assumptions; if so, another model should be considered.
    • Finally, they may simply be unusual values that occurred by chance. In this case, they should be retained.
  • Keeping vs. Removing Outliers
    • data value that has been incorrectly recorded /included - should be removed
    • unusual data value that has been recorded correctly and belongs in the data set - should be kept
  • Standardized values (z-scores) can be used to identify outliers.
    • Empirical Rule allows us to conclude that for normal distribution, almost all the data values will be within three standard deviations of the mean \((\overline{x} \pm 3s)\).
    • Hence, in using z-scores to identify outliers, we recommend treating any data value with a z-score less than −3 or greater than +3 as an outlier.
    • Such data values can then be reviewed for accuracy and to determine whether they belong in the data set.
    • In the case of normally distributed data, the three sigma rule can be used to identify outliers.
      • In a sample of 1000 observations, the presence of up to five observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected. If the sample size is only 100, however, just three such outliers are already reason for concern.
  • Another approach to identifying outliers is based upon IQR
    • \(\text{Lower Limit} = Q_1 - 1.5 \space \text{IQR}\) and \(\text{Upper Limit} = Q_3 + 1.5 \space \text{IQR}\)
    • An observation is classified as an outlier if its value is less than the lower limit or greater than the upper limit.

25.10 Summary

Five-Number Summary is used to quickly summarise a dataset. i.e. Min, Q1, Median, Q3, Max

  • A boxplot is a graphical display of data based on a five-number summary.
    • By using the interquartile range, IQR = Q3 − Q1, limits are located at 1.5(IQR) below Q1 and 1.5(IQR) above Q3
    • The whiskers are drawn from the ends of the box to the smallest and largest values inside the limits
    • Boxplots can also be used to provide a graphical summary of two or more groups and facilitate visual comparisons among the groups.

BoxPlot

(C03P01) geom_boxplot()

Figure 25.3 (C03P01) geom_boxplot()

Code

# #nycflights13::weather
bb <- weather
# #NA are present in the data
summary(bb$temp)
#
# #BoxPlot
C03P01 <- bb %>% drop_na(temp) %>% mutate(month = factor(month, ordered = TRUE)) %>% {
    ggplot(data = ., mapping = aes(x = month, y = temp)) +
    #geom_violin() +
    geom_boxplot(aes(fill = month), outlier.colour = 'red', notch = TRUE) +
    stat_summary(fun = mean, geom = "point", size = 2, color = "steelblue") + 
    scale_y_continuous(breaks = seq(0, 110, 10), limits = c(0, 110)) +
    #geom_point() +
    #geom_jitter(position=position_jitter(0.2)) +
    k_gglayer_box +
    theme(legend.position = 'none') +
    labs(x = "Months", y = "Temperature", subtitle = "With Mean & Notch", 
         caption = "C03P01", title = "BoxPlot")
}

25.11 Relationship between Two Variables

25.11.1 Covariance

Definition 25.25 Covariance is a measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship.
  • Refer equation (25.15)
    • For a sample of size \({n}\) with the observations \((x_1, y_1), (x_2, y_2)\), and so on, the covariance is given by equation (25.15)
    • A positive value for \(s_{xy}\) indicates a positive linear association between x and y; that is, as the value of x increases, the value of y increases. Similarly a negative value shows a negative linear association.
      • In the example, \(s_{xy} = 11\)
    • If the points are evenly distributed in the scatterplot, the value of \(s_{xy}\) will be close to zero, indicating no linear association between x and y.
    • Caution: Problem with using covariance as a measure of the strength of the linear relationship is that the value of the covariance depends on the units of measurement for x and y.

\[\begin{align} \sigma_{xy} &= \frac{\sum (x_i - \mu_x)(y_i - \mu_y)}{n} \\ s_{xy} &= \frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{n-1} \end{align} \tag{25.15}\]

(C02P05 C03P02) Scatter Plot Quadrants for Covariance(C02P05 C03P02) Scatter Plot Quadrants for Covariance

Figure 25.4 (C02P05 C03P02) Scatter Plot Quadrants for Covariance

Covariance

# #Get 'Deviation about the mean' i.e. devX and devY and their Product devXY
ii <- bb %>% 
  mutate(devX = Commercials - mean(Commercials), devY = Sales - mean(Sales), devXY = devX * devY) 
#
# #Sample Covariance
sxy <- sum(ii$devXY) / {length(ii$devXY) -1}
print(sxy)
## [1] 11

Code

bb <- f_getRDS(xxCommercials) 

# #Formula for Trendline calculation
k_gg_formula <- y ~ x
#
# #Scatterplot, Trendline Equation, R2, mean x & y
C03P02 <- bb %>% {
  ggplot(data = ., aes(x = Commercials, y = Sales)) + 
  geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
  stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")), 
               formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
               eq.x.rhs = "~italic(x)", parse = TRUE) +
  geom_vline(aes(xintercept = round(mean(Commercials), 3)), color = 'red', linetype = "dashed") +
  geom_hline(aes(yintercept = round(mean(Sales), 3)), color = 'red', linetype = "dashed") +
  geom_text(aes(label = TeX(r"($\bar{x} = 3$)", output = "character"), 
                x = round(mean(Commercials), 3), y = -Inf), 
            color = 'red', , hjust = -0.2, vjust = -0.5, parse = TRUE, check_overlap = TRUE) + 
  geom_text(aes(label = TeX(r"($\bar{y} = 51$)", output = "character"), 
                x = Inf, y = round(mean(Sales), 3)), 
            color = 'red', , hjust = 1.5, vjust = -0.5, parse = TRUE, check_overlap = TRUE) + 
  geom_point() +
  k_gglayer_scatter +
  labs(x = "Commercials", y = "Sales ($100s)",
       subtitle = TeX(r"(Trendline Equation, $R^{2}$, $\bar{x}$ and $\bar{y}$)"), 
       caption = "C03P02", title = "Scatter Plot")
}

More Text

  • This
    • Unlike Pearson correlation, covariance itself is not a measure of the magnitude of linear relationship. It is a measure of co-variation (which could be just monotonic). This is because covariance depends not only on the strength of linear association but also on the magnitude of the variances.
  • More details are in following links

25.11.2 Correlation Coefficient

Definition 25.26 Correlation coefficient is a measure of linear association between two variables that takes on values between −1 and +1. Values near +1 indicate a strong positive linear relationship; values near −1 indicate a strong negative linear relationship; and values near zero indicate the lack of a linear relationship.
  • Refer equation (25.16) & Table 25.1
    • The ‘Pearson Product Moment Correlation Coefficient’ or sample correlation coefficient is computed by dividing the sample covariance \(s_{xy}\) by the product of the sample standard deviation of x (\(s_{x}\)) and the sample standard deviation of y (\(s_{y}\)).
      • Values close to −1 (negative) or +1 (positive) indicate a strong linear relationship. The closer the correlation is to zero, the weaker the relationship.
    • In the example, \(s_{xy} = 11\) (Equation (25.15)) and \(s_{x} = 1.49\), \(s_{y} = 7.93\) (Equation (25.12))
    • Thus, \(r_{xy} = 0.93\)
    • Caution: Correlation provides a measure of linear association and not necessarily causation. A high correlation between two variables does not mean that changes in one variable will cause changes in the other variable.
    • Caution: Because the correlation coefficient measures only the strength of the linear relationship between two quantitative variables, it is possible for the correlation coefficient to be near zero, suggesting no linear relationship, when the relationship between the two variables is nonlinear.

\[\begin{align} \rho_{xy} &= \frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}} \\ r_{xy} &= \frac{s_{xy}}{s_{x}s_{y}} \end{align} \tag{25.16}\]

Correlation

# #Get 'Deviation about the mean' i.e. devX and devY and their Product devXY
ii <- bb %>% 
  mutate(devX = Commercials - mean(Commercials), devY = Sales - mean(Sales), devXY = devX * devY) 
#
# #Sample Covariance
sxy <- sum(ii$devXY) / {length(ii$devXY) -1}
print(sxy)
## [1] 11
jj <- ii %>% mutate(devXsq = devX * devX, devYsq = devY * devY)
# #Sample Covariance Sx, Sample Standard Deviations Sx Sy
sxy <- sum(ii$devXY) / {nrow(ii) -1}
sx <- round(sqrt(sum(jj$devXsq) / {nrow(jj) -1}), 2)
sy <- round(sqrt(sum(jj$devYsq) / {nrow(jj) -1}), 2)
cat(paste0("Sxy =", sxy, ", Sx =", sx, ", Sy =", sy, "\n"))
## Sxy =11, Sx =1.49, Sy =7.93
#
# #Correlation Coefficient Rxy
rxy <- round(sxy / {sx * sy}, 2)
cat(paste0("Correlation Coefficient Rxy =", rxy, "\n"))
## Correlation Coefficient Rxy =0.93

Data

Table 25.1: (C03T01) Correlation Calculation
Week Commercials Sales devX devY devXY devXsq devYsq
1 2 50 -1 -1 1 1 1
2 5 57 2 6 12 4 36
3 1 41 -2 -10 20 4 100
4 3 54 0 3 0 0 9
5 4 54 1 3 3 1 9
6 1 38 -2 -13 26 4 169
7 5 63 2 12 24 4 144
8 3 48 0 -3 0 0 9
9 4 59 1 8 8 1 64
10 2 46 -1 -5 5 1 25

Validation


26 Probability

26.1 Overview

  • This chapter covers Probability, Factorial, Combinations, Permutations, Bayes Theorem.

26.2 Probability

Definition 26.1 Probability is a numerical measure of the likelihood that an event will occur. Probability values are always assigned on a scale from 0 to 1. A probability near zero indicates an event is unlikely to occur; a probability near 1 indicates an event is almost certain to occur.
Definition 26.2 A random experiment is a process that generates well-defined experimental outcomes. On any single repetition or trial, the outcome that occurs is determined completely by chance.
Definition 26.3 The sample space for a random experiment is the set of all experimental outcomes.
  • Random experiment of tossing a coin has a Sample Space \(S = \{\text{Head}, \text{Tail}\}\)
  • Random experiment of rolling a die has a Sample Space \(S = \{1, 2, 3, 4, 5, 6\}\)
  • Random experiment of tossing Two coins has a Sample Space \(S = \{\text{HH}, \text{HT}, \text{TH}, \text{TT}\}\)

26.3 Counting Rule

Definition 26.4 Counting Rule for Multiple-Step Experiments: If an experiment can be described as a sequence of \({k}\) steps with \({n_1}\) possible outcomes on the first step, \({n_2}\) possible outcomes on the second step, and so on, then the total number of experimental outcomes is given by \(\{(n_1)(n_2) \cdots (n_k) \}\)
Definition 26.5 A tree diagram is a graphical representation that helps in visualizing a multiple-step experiment.

26.4 Factorial

Definition 26.6 The factorial of a non-negative integer \({n}\), denoted by \(n!\), is the product of all positive integers less than or equal to n. The value of 0! is 1 i.e. \(0!=1\)

\[\begin{align} n! &= \prod _{i=1}^n i = n \cdot (n-1) \\ &= n \cdot(n-1)\cdot(n-2)\cdot(n-3)\cdot\cdots \cdot 3 \cdot 2 \cdot 1 \end{align} \tag{26.1}\]

26.5 Combinations

Definition 26.7 Combination allows one to count the number of experimental outcomes when the experiment involves selecting \({k}\) objects from a set of \({N}\) objects. The number of combinations of \({N}\) objects taken \({k}\) at a time is equal to the binomial coefficient \(C_k^N\)

\[C_k^N = \binom{N}{k} = \frac{N!}{k!(N-k)!} \tag{26.2}\]

26.6 Permutations

Definition 26.8 Permutation allows one to compute the number of experimental outcomes when \({k}\) objects are to be selected from a set of \({N}\) objects where the order of selection is important. The same \({k}\) objects selected in a different order are considered a different experimental outcome. The number of permutations of \({N}\) objects taken \({k}\) at a time is given by \(P_k^N\)

\[P_k^N = k! \binom{N}{k} = \frac{N!}{(N-k)!} \tag{26.3}\]

  • The number of permutations of \({k}\) distinct objects is \(k!\)
    • An experiment results in more permutations than combinations for the same number of objects because every selection of \({k}\) objects can be ordered in \(k!\) different ways.

26.7 Assigning Probabilities

  • Basic Requirements (Similar to the Discrete Probability & Continuos Probability)
    1. The probability assigned to each experimental outcome must be between 0 and 1, inclusively. If we let \({E_i}\) denote the \(i^{th}\) experimental outcome and \(P(E_i)\) its probability, then \(P(E_i) \in [0, 1]\)
    2. The sum of the probabilities for all the experimental outcomes must equal 1. Thus for \({k}\) experimental outcomes \(\sum _{i=1}^k P(E_i) =1\)
Definition 26.9 An event is a collection of sample points. The probability of any event is equal to the sum of the probabilities of the sample points in the event. The sample space, \({s}\), is an event. Because it contains all the experimental outcomes, it has a probability of 1; that is, \(P(S) = 1\)
Definition 26.10 Given an event \({A}\), the complement of A (\(A^c\)) is defined to be the event consisting of all sample points that are not in A. Thus, \(P(A) + P(A^{c}) =1\)
Definition 26.11 Given two events A and B, the union of A and B is the event containing all sample points belonging to A or B or both. The union is denoted by \(A \cup B\)
Definition 26.12 Given two events A and B, the intersection of A and B is the event containing the sample points belonging to both A and B. The intersection is denoted by \(A \cap B\)
  • Refer to the Addition Law in the equation (26.4)

\[P(A \cup B) = P(A) + P(B) - P(A \cap B) \tag{26.4}\]

Definition 26.13 Two events are said to be mutually exclusive if the events have no sample points in common. Thus, \(A \cap B = 0\)

26.8 Exercises

  • How many ways can three items be selected from a group of six items
    • Solution: \(C_{3}^{6} = 6!/3!3! = 120\)
  • In a experiment of tossing a coin three times, how many experimental outcomes can be
    • Solution: \(2^{3} = 8\)
  • Simple random sampling uses a sample of size k from a population of size N to obtain data that can be used to make inferences about the characteristics of a population. Suppose that, from a population of 50 bank accounts, we want to take a random sample of four accounts in order to learn about the population. How many different random samples of four accounts are possible
    • Solution: \(C_{4}^{50} = 50!/4!46!\)
  • To play Powerball, a participant must select five numbers from the digits 1 through 59, and then select a Powerball number from the digits 1 through 35. To determine the winning numbers for each game, lottery officials draw 5 white balls out a drum of 59 white balls numbered 1 through 59 and 1 red ball out of a drum of 35 red balls numbered 1 through 35. To win the Powerball jackpot, numbers on the lottery must match the numbers on the 5 white balls in any order and must also match the number on the red Powerball. How many Powerball lottery outcomes are possible
    • Solution: \(C_{5}^{59} \times C_{1}^{35}\)
  • An experiment has four equally likely outcomes: E1, E2, E3, and E4
    • What is the probability that E2 occurs
      • Solution: \({1/4}\)
    • What is the probability that any two of the outcomes occur (e.g., E1 or E3)
      • Solution: \(2/4 = 1/2\)
    • What is the probability that any three of the outcomes occur (e.g., E1 or E2 or E4)
      • Solution: \({3/4}\)
  • Consider the experiment of selecting a playing card from a deck of 52 playing cards. Each card corresponds to a sample point with a 1/52 probability.
    • Probability of the event that an ace is selected
      • Solution: \(4/52 = 1/13\)
    • Probability of the event that a club is selected
      • Solution: \(13/52 = 1/4\)
    • Probability of the event that a face card (jack, queen, or king) is selected
      • Solution: \(4\times3/52\)
  • Consider the experiment of rolling a pair of dice. Suppose that we are interested in the sum of the face values showing on the dice.
    • How many sample points are possible
      • Solution: \(6 \times 6 = 36\)
    • What is the probability of obtaining a value of 7
      • Solution: \(E_{7} = \{(1,6), (6,1), (2,5), (5,2), (3,4), (4,3)\} \Rightarrow P(E_{7}) = 6/36 = 1/6\)
    • What is the probability of obtaining a value of 9 or greater
      • Solution: \(P(E_{\geq9}) = P(E_{9}, E_{10}, E_{11}, E_{12}) = \frac{4 + 3 + 2 + 1}{36} = \frac{5}{18}\)
    • Because each roll has six possible even values (2, 4, 6, 8, 10, and 12) and only five possible odd values (3, 5, 7, 9, and 11), the dice should show even values more often than odd values. Do you agree with this statement
      • Solution: \(\text{NO: } P(E_{\text{odd}}) = P(E_{\text{even}}) = 1/2 \iff E_{\text{odd}} = E_{\text{even}} = 18\)
  • A survey of magazine subscribers showed that 45.8% rented a car during the past 12 months for business reasons, 54% rented a car during the past 12 months for personal reasons, and 30% rented a car during the past 12 months for both business and personal reasons.
    • Let B denote Business, P denote Personal
    • What is the probability that a subscriber rented a car during the past 12 months for business or personal reasons
      • Solution: \(P(B \cup P) = P(B) + P(P) - P(B \cap P) = 0.458 + 0.540 - 0.3 = 0.698\)
    • What is the probability that a subscriber did not rent a car during the past 12 months for either business or personal reasons
    • Solution: \(P(B \cup P)^{c} = 1 - 0.698 = 0.302\)

26.9 Conditional Probability

Definition 26.14 Conditional probability is the probability of an event given that another event already occurred. The conditional probability of ‘A given B’ is \(P(A|B) = \frac{P(A \cup B)}{P(B)}\)
Table 26.1: (C04T01) Police: Promotion and Gender
Promo_Gender Men Women SUM
Promoted 288 36 324
NotPromoted 672 204 876
Total 960 240 1200
Table 26.1: (C04T01A) Joint and Marginal Probabilities
Promo_Gender Men Women SUM
Promoted 0.24 0.03 0.27
NotPromoted 0.56 0.17 0.73
Total 0.80 0.20 1.00
  • Refer to the Police Promotion Table 26.1
    • Let, M (Man), W (Woman), A (Promoted), \(A^{c}\) (Not Promoted)
    • Probability that a randomly selected officer …
      • is man and is promoted: \(P(A \cap M) = 288/1200 = 0.24\)
      • is woman and is promoted: \(P(A \cap W) = 36/1200 = 0.03\)
      • is man and is not promoted: \(P(A^{c} \cap M) = 672/1200 = 0.56\)
      • is woman and is not promoted: \(P(A^{c} \cap W) = 204/1200 = 0.17\)
      • NOTE: Each of these are Joint Probabilities because these provide intersection of two events.
    • Marginal probabilities are the values in the margins of the joint probability table and indicate the probabilities of each event separately.
      • \(P(M) = 0.80, P(W) = 0.20, P(A) = 0.27, P(A^{c}) = 0.73\)
      • Ex: the marginal probability of being promoted is \(P(A) = P(A \cap M) + P(A \cap W)\)
    • Conditional Probability Analysis
      • “the probability that an officer is promoted given that the officer is a man” \(P(A|M)\)
        • \(P(A|M) = 288/960 = 0.30\)
        • OR \(P(A|M) = P(A \cap M) / P(M) = 0.24/0.80 = 0.30\)
        • “Given that an officer is a man, that officer had a 30% chance of receiving a promotion”
      • “the probability that an officer is promoted given that the officer is a woman” \(P(A|W)\)
        • \(P(A|W) = P(A \cap W) / P(W) = 0.03/0.20 = 0.15\)
        • “Given that an officer is a woman, that officer had a 15% chance of receiving a promotion”
      • Conclusion
        • The probability of a promotion given that the officer is a man is .30, twice the .15 probability of a promotion given that the officer is a woman.
        • Although the use of conditional probability does not in itself prove that discrimination exists in this case, the conditional probability values do support this argument.
Definition 26.15 Two events A and B are independent if \(P(A|B) = P(A) \quad \text{OR} \quad P(B|A) = P(B) \Rightarrow P(A \cap B) = P(A) \cdot P(B)\)
  • Refer to the Multiplication Law in the equation (26.5)
    • Example: 84% of the households in a neighborhood subscribe to the daily edition of a newspaper; that is, \(P(D) =0.84\). In addition, it is known that the probability that a household that already holds a daily subscription also subscribes to the Sunday edition is .75; that is, \(P(S|D) =0.75\)
      • What is the probability that a household subscribes to both the Sunday and daily editions of the newspaper
        • \(P(S \cap D) = P(D) \cdot P(S|D) = 0.84 \times 0.75 = 0.63\)
        • “63% of the households subscribe to both the Sunday and daily editions”

\[\begin{align} P(A \cap B) &= P(B) \cdot P(A | B) \\ &= P(A) \cdot P(B | A) \end{align} \tag{26.5}\]

  • Mutually Exclusive vs. Independent Events
    • Two events with nonzero probabilities cannot be both mutually exclusive and independent.
    • If one mutually exclusive event is known to occur, the other cannot occur; thus, the probability of the other event occurring is reduced to zero. They are therefore dependent.

26.10 Bayes Theorem

Often, we begin the analysis with initial or prior probability estimates for specific events of interest. Then, from sources such as a sample, a special report, or a product test, we obtain additional information about the events. Given this new information, we update the prior probability values by calculating revised probabilities, referred to as posterior probabilities. Bayes theorem provides a means for making these probability calculations.

  • Refer to the equation (26.6)
    • Bayes theorem is applicable when the events for which we want to compute posterior probabilities are mutually exclusive and their union is the entire sample space.
      • An event, \(P(A)\), and its complement, \(P(A^{c})\), are mutually exclusive, and their union is the entire sample space. Thus, Bayes theorem is always applicable for computing posterior probabilities of an event and its complement.
    • Example: A firm has two suppliers, currently 65% parts are supplied by one and remaining by other; that is, \(P(A_{1}) = 0.65, P(A_{2}) = 0.35\). Quality of products supplied is 98% Good for supplier one and 95% Good for supplier 2.
      • \(P(G|A_{1}) = 0.98, P(B|A_{1}) = 0.02\)
      • \(P(G|A_{2}) = 0.95, P(B|A_{2}) = 0.05\)
      • Given that we received a Bad Part, what is the probability that it came from supplier 2
        • \(P(A_{2}|B) = \frac{P(A_{2})P(B|A_{2})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} = \frac{0.35 \times 0.05}{0.65 \times 0.02 + 0.35 \times 0.05} = 0.5738 \approx 57\%\)
        • Similarly, \(P(A_{1}|B) = 0.4262 \approx 43\%\)
      • NOTE: While the Probability of a random part being from supplier 1 is \(P(A_{1}) = 0.65\), it is reduced to \(P(A_{1}|B) = 0.4262 \approx 43\%\) as we have received new information that the part is Bad.

\[\begin{align} P(A_{1}|B) &= \frac{P(A_{1})P(B|A_{1})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} \\ P(A_{2}|B) &= \frac{P(A_{2})P(B|A_{2})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} \end{align} \tag{26.6}\]

Validation


27 Discrete Probability Distributions

27.1 Overview

27.2 Definitions (Ref)

23.13 Quantitative data that measure ‘how many’ are discrete.

23.14 Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.

27.3 Random Variable

Definition 27.1 A random variable is a numerical description of the outcome of an experiment. Random variables must assume numerical values. It can be either ‘discrete’ or ‘continuous.’
Definition 27.2 A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.
Definition 27.3 A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.

27.4 Discrete Probability Distributions

Definition 27.4 The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.
Definition 27.5 For a discrete random variable x, a probability function \(f(x)\), provides the probability for each value of the random variable.
  • The use of the relative frequency method to develop discrete probability distributions leads to what is called an empirical discrete distribution.
    • We treat the data as if they were the population and use the relative frequency method to assign probabilities to the experimental outcomes.
    • The distribution of data is how often each observation occurs, and can be described by its central tendency and variation around that central tendency.
  • Basic Requirements (Similar to the Probability Basics & Continuos Probability)
    1. \(f(x) \geq 0\)
    2. \(\sum {f(x)} = 1\)
  • The simplest example of a discrete probability distribution given by a formula is the discrete uniform probability distribution; \(f(x) = 1/n\), where \({n}\) is the number of values the random variable may assume
    • Each possible value of the random variable has the same probability

27.4.1 Expected Value

Definition 27.6 The expected value, or mean, of a random variable is a measure of the central location for the random variable. i.e. \(E(x) = \mu = \sum xf(x)\)
  • NOTE
    • The expected value is a weighted average of the values of the random variable where the weights are the probabilities.
    • The expected value does not have to be a value the random variable can assume. i.e. average need not to be integer

27.4.2 Variance

Definition 27.7 The variance is a weighted average of the squared deviations of a random variable from its mean. The weights are the probabilities. i.e. \(\text{Var}(x) = \sigma^2 = \sum \{(x- \mu)^2 \cdot f(x)\}\)

27.5 Bivariate Distributions

Definition 27.8 A probability distribution involving two random variables is called a bivariate probability distribution. A discrete bivariate probability distribution provides a probability for each pair of values that may occur for the two random variables.
  • NOTE:
    • Each outcome for a bivariate experiment consists of two values, one for each random variable. Example: Rolling a pair of dice
    • Bivariate probabilities are often called joint probabilities

27.6 Ex Dicarlo

Table

Table 27.1: (C05T04) Variance Calculation
\({x}\) \(f(x)\) \(\sum xf(x)\) \((x - \mu)\) \((x - \mu)^2\) \(\sum {(x - \mu)^{2}f(x)}\)
0 0.18 0 -1.5 2.25 0.405
1 0.39 0.39 -0.5 0.25 0.0975
2 0.24 0.48 0.5 0.25 0.06
3 0.14 0.42 1.5 2.25 0.315
4 0.04 0.16 2.5 6.25 0.25
5 0.01 0.05 3.5 12.25 0.1225
Total 1.00 mu = 1.5 NA NA sigma^2 = 1.25

Data

# #Dicarlo: Days with Number of Cars Sold per day for last 300 days
xxdicarlo <- tibble(Cars = 0:5, Days = c(54, 117, 72, 42, 12, 3))
#
bb <- xxdicarlo
bb <- bb %>% rename(x = Cars, Fx = Days) %>% mutate(across(Fx, ~./sum(Fx))) %>% 
  mutate(xFx = x * Fx, x_mu = x - sum(xFx), 
             x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx) 
R_dicarlo_var_y_C05 <- sum(bb$x_mu_sq_Fx)
# #Total Row
bb <- bb %>% 
  mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) %>% 
  mutate(xFx = ifelse(x == "Total", paste0("mu = ", xFx), xFx),
         x_mu_sq_Fx = ifelse(x == "Total", paste0("sigma^2 = ", x_mu_sq_Fx), x_mu_sq_Fx)) %>% 
  mutate(across(4:5, ~ replace(., x == "Total", NA)))

Change Class

# #Change Column Classes as required
bb %>% mutate(across(1, as.character))
bb %>% mutate(across(everything(), as.character))

Modify Value

bb <- xxdicarlo
ii <- bb %>% rename(x = Cars, Fx = Days) %>% mutate(across(Fx, ~./sum(Fx))) %>% 
  mutate(xFx = x * Fx, x_mu = x - sum(xFx), 
             x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx) 
# #Add Total Row
ii <- ii %>% 
  mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) 
#
# #Modify Specific Row Values without using filter() 
# #filter() does not have 'un-filter()' function like group()-ungroup() combination
# #Selecting Row where x = "Total" and changing Column Values for Two Columns
ii <- ii %>% 
  mutate(xFx = ifelse(x == "Total", paste0("mu = ", xFx), xFx),
       x_mu_sq_Fx = ifelse(x == "Total", paste0("sigma^2 = ", x_mu_sq_Fx), x_mu_sq_Fx)) 
#
# #Selecting Row where x = "Total" and doing same replacement on Two Columns
ii %>% mutate(across(4:5, function(y) replace(y, x == "Total", NA)))
ii %>% mutate(across(4:5, ~ replace(., x == "Total", NA)))

27.7 Ex Dicarlo GS

Table

Table 27.2: (C05T01) Bivariate Table
Geneva_Saratoga y0 y1 y2 y3 y4 y5 SUM
x0 21 30 24 9 2 0 86
x1 21 36 33 18 2 1 111
x2 9 42 9 12 3 2 77
x3 3 9 6 3 5 0 26
Total 54 117 72 42 12 3 300
Table 27.2: (C05T02) Probability Distribution
Geneva_Saratoga y0 y1 y2 y3 y4 y5 SUM
x0 0.07 0.10 0.08 0.03 0.007 0.000 0.29
x1 0.07 0.12 0.11 0.06 0.007 0.003 0.37
x2 0.03 0.14 0.03 0.04 0.010 0.007 0.26
x3 0.01 0.03 0.02 0.01 0.017 0.000 0.09
Total 0.18 0.39 0.24 0.14 0.040 0.010 1.00

DataGS

xxdicarlo_gs <- tibble(Geneva_Saratoga = c("x0", "x1", "x2", "x3"), 
             y0 = c(21, 21, 9, 3), y1 = c(30, 36, 42, 9), y2 = c(24, 33, 9, 6), 
             y3 = c(9, 18, 12, 3), y4 = c(2, 2, 3, 5), y5 = c(0, 1, 2, 0))
bb <- xxdicarlo_gs
#
# #Tibble Total SUM 
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
#
# #Add Total Row and SUM Column
ii <- bb %>% 
  mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.numeric), sum))) %>% 
  mutate(SUM = rowSums(across(where(is.numeric))))
#
# #Convert to Bivirate Probability Distribution and then add Total Row and SUM Column
jj <- bb %>% 
  mutate(across(where(is.numeric), ~./sum_bb)) %>% 
  mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.numeric), sum))) %>% 
  mutate(SUM = rowSums(across(where(is.numeric)))) %>% 
  mutate(across(where(is.numeric), format, digits =1))

Tibble Total SUM

bb <- xxdicarlo_gs
# #Assuming there is NO Total Column NOR Total Row and First Column is character
kk <- bb %>% summarise(across(where(is.numeric), sum)) %>% summarise(sum(.)) %>% pull(.)
ll <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
stopifnot(identical(kk, ll))
print(kk)
## [1] 300

format()

bb <- xxdicarlo_gs
# #Round off values to 1 significant digits i.e. 0.003 or 0.02
# #NOTE: This changes the column to "character"
bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% 
  mutate(across(where(is.numeric), format, digits =1))
## # A tibble: 4 x 7
##   Geneva_Saratoga y0    y1    y2    y3    y4    y5   
##   <chr>           <chr> <chr> <chr> <chr> <chr> <chr>
## 1 x0              0.07  0.10  0.08  0.03  0.007 0.000
## 2 x1              0.07  0.12  0.11  0.06  0.007 0.003
## 3 x2              0.03  0.14  0.03  0.04  0.010 0.007
## 4 x3              0.01  0.03  0.02  0.01  0.017 0.000

27.8 Bivariate …

  • Suppose we would like to know the probability distribution for total sales at both DiCarlo dealerships and the expected value and variance of total sales.
    • We can define \(s = x + y\) as Total Sales.
    • Refer to the Tables 27.2 and 27.3
      • \(f(s_0) = f(x_0, y_0) = 0.07\)
      • \(f(s_1) = f(x_0, y_1) + f(x_1, y_0) = 0.10 + 0.07 = 0.17\)

Table

Table 27.3: (C05T03) Bivariate Expected Value and Variance
\(ID\) \({s}\) \(f(s)\) \(\sum sf(s)\) \((s - E(s))\) \((s - E(s))^2\) \(\sum {(s - E(s))^{2}f(s)}\)
A 0 0.070 0.00 -2.64 6.99 0.489
B 1 0.170 0.17 -1.64 2.70 0.459
C 2 0.230 0.46 -0.64 0.41 0.095
D 3 0.290 0.87 0.36 0.13 0.037
E 4 0.127 0.51 1.36 1.84 0.233
F 5 0.067 0.33 2.36 5.55 0.370
G 6 0.023 0.14 3.36 11.27 0.263
H 7 0.023 0.16 4.36 18.98 0.443
I 8 0.000 0.00 5.36 28.69 0.000
Total NA 1.000 E(s) = 2.64 NA NA Var(s) = 2.389

Code

bb <- xxdicarlo_gs
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
# #Convert to Bivariate Probability Distribution
ii <- bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% select(-1)
# #Using tapply(), sum the Matrix
jj <- tapply(X= as.matrix(ii), INDEX = LETTERS[row(ii) + col(ii)-1], FUN = sum)
# #Create Tibble
kk <- tibble(Fs = jj, ID = LETTERS[1:length(Fs)], s = 1:length(Fs) - 1) %>% 
  relocate(Fs, .after = last_col()) %>% 
  mutate(sFs = s * Fs, s_Es = s - sum(sFs), 
             s_Es_sq = s_Es * s_Es, s_Es_sq_Fs = s_Es_sq * Fs) 
# #Save for Notebook
R_dicarlo_var_s_C05 <- sum(kk$s_Es_sq_Fs)
# #For Printing
ll <- kk %>% 
  add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) %>% 
  mutate(across(where(is.numeric), format, digits =2)) %>% 
  mutate(sFs = ifelse(ID == "Total", paste0("E(s) = ", sFs), sFs),
         s_Es_sq_Fs = ifelse(ID == "Total", paste0("Var(s) = ", s_Es_sq_Fs), s_Es_sq_Fs)) %>% 
  mutate(across(c(2, 5, 6), ~ replace(., ID == "Total", NA)))

Bivariate to Original

bb <- xxdicarlo_gs
# #From the Bivariate get the original data
ii <- bb %>% 
  mutate(Fx = rowSums(across(where(is.numeric)))) %>% 
  select(1, 8) %>% 
  separate(col = Geneva_Saratoga, into = c(NA, "x"), sep = 1) %>% 
  mutate(across(1, as.integer))
# #Variance Calculation
jj <- ii %>% mutate(across(Fx, ~./sum(Fx))) %>% 
  mutate(xFx = x * Fx, x_mu = x - sum(xFx), 
             x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx) 
# #Save for Notebook
R_dicarlo_var_x_C05 <- sum(jj$x_mu_sq_Fx)
print(jj)
## # A tibble: 4 x 6
##       x     Fx   xFx   x_mu x_mu_sq x_mu_sq_Fx
##   <int>  <dbl> <dbl>  <dbl>   <dbl>      <dbl>
## 1     0 0.287  0     -1.14   1.31      0.375  
## 2     1 0.37   0.37  -0.143  0.0205    0.00760
## 3     2 0.257  0.513  0.857  0.734     0.188  
## 4     3 0.0867 0.26   1.86   3.45      0.299

Sum Diagonals

bb <- xxdicarlo_gs
#
# #Tibble Total SUM 
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
#
# #Convert to Bivirate Probability Distribution and Exclude First Character Column
ii <- bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% select(-1)
#
# #(1A, 2B, 3C, 4D, 4E, 4F, 3G, 2H, 1I) 9 Unique Combinations = 24 (4x6) Experimental Outcomes 
matrix(data = LETTERS[row(ii) + col(ii)-1], nrow = 4)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "A"  "B"  "C"  "D"  "E"  "F" 
## [2,] "B"  "C"  "D"  "E"  "F"  "G" 
## [3,] "C"  "D"  "E"  "F"  "G"  "H" 
## [4,] "D"  "E"  "F"  "G"  "H"  "I"
# 
# #Using tapply(), sum the Matrix
jj <- tapply(X= as.matrix(ii), INDEX = LETTERS[row(ii) + col(ii)-1], FUN = sum)
print(jj)
##          A          B          C          D          E          F          G          H          I 
## 0.07000000 0.17000000 0.23000000 0.29000000 0.12666667 0.06666667 0.02333333 0.02333333 0.00000000
# #In place of LETTERS, Numerical Index can also be used but Letters are more clear for grouping
#tapply(X= as.matrix(ii), INDEX = c(0:8)[row(ii) + col(ii)-1], FUN = sum)
#
# #Create Tibble
kk <- tibble(Fs = jj, ID = LETTERS[1:length(Fs)], s = 1:length(Fs) - 1) %>% 
  relocate(Fs, .after = last_col())
print(kk)
## # A tibble: 9 x 3
##   ID        s     Fs
##   <chr> <dbl>  <dbl>
## 1 A         0 0.07  
## 2 B         1 0.17  
## 3 C         2 0.23  
## 4 D         3 0.29  
## 5 E         4 0.127 
## 6 F         5 0.0667
## 7 G         6 0.0233
## 8 H         7 0.0233
## 9 I         8 0

String Split

bb <- xxdicarlo_gs
# #Separate String based on Position 
bb %>% separate(col = Geneva_Saratoga, into = c("A", "B"), sep = 1) 
## # A tibble: 4 x 8
##   A     B        y0    y1    y2    y3    y4    y5
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x     0        21    30    24     9     2     0
## 2 x     1        21    36    33    18     2     1
## 3 x     2         9    42     9    12     3     2
## 4 x     3         3     9     6     3     5     0

27.9 Covariance

  • Covariance of random variables x and y is given by \(\sigma_{xy}\), Refer equation (27.1)
    • NOTE: It does not look like (25.15), but for now, I am assuming it is Equivalent
    • Calculated: \(\text{Var}(s) = \text{Var}(x + y) =\) 2.389; \(\text{Var}(y) =\) 1.25; \(\text{Var}(x) =\) 0.869
    • Variance \(\sigma_{xy} = \frac{2.3895 - 0.8696 - 1.25}{2} = 0.1350\)
    • A covariance of .1350 indicates that daily sales at the two dealerships have a positive relationship.

\[\sigma_{xy} = \frac{\text{Var}(x + y) - \text{Var}(x) - \text{Var}(y)}{2} \tag{27.1}\]

  • Correlation of random variables x and y is given by, Refer equation (25.16), \(\rho_{xy} = \frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}}\)
    • Where \(\sigma_{x} = \sqrt{\text{Var}(x)} = \sqrt{0.8696} = 0.9325\); and \(\sigma_{y} = \sqrt{\text{Var}(y)} = \sqrt{1.25} = 1.1180\)
    • Correlation Coefficient \(\rho_{xy} = \frac{0.1350}{0.9325 \times 1.1180} = 0.1295\)
    • The correlation coefficient of .1295 indicates there is a weak positive relationship between the random variables representing daily sales at the two dealerships.

27.10 Distributions

  • “ForLater”
    • Binomial Probability Distribution - dbinom(), pbinom(), qbinom(), rbinom()
      • It can be used to determine the probability of obtaining \({x}\) successes in \({n}\) trials.
      • 4 Assumptions must be TRUE
        1. The experiment consists of a sequence of \({n}\) identical trials.
        2. Two outcomes are possible on each trial, one called success and the other failure.
        3. The probability of a success \({p}\) does not change from trial to trial. Consequently, the probability of failure, \(1 − p\), does not change from trial to trial.
        4. The trials are independent.
    • Poisson Probability Distribution - dpois(), ppois(), qpois(), rpois()
      • To determine the probability of obtaining \({x}\) occurrences over an interval of time or space.
      • 2 Assumptions must be TRUE
        1. The probability of an occurrence of the event is the same for any two intervals of equal length.
        2. The occurrence or nonoccurrence of the event in any interval is independent of the occurrence or nonoccurrence of the event in any other interval.
    • Hypergeometric Probability Distribution
      • Like the binomial, it is used to compute the probability of \({x}\) successes in \({n}\) trials.
      • But, in contrast to the binomial, the probability of success changes from trial to trial.

Validation


28 Continuous Probability Distributions

28.1 Overview

28.2 Definitions (Ref)

27.2 A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.

27.3 A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.

28.3 Uniform Probability Distribution

Definition 28.1 Uniform probability distribution is a continuous probability distribution for which the probability that the random variable will assume a value in any interval is the same for each interval of equal length. Whenever the probability is proportional to the length of the interval, the random variable is uniformly distributed.
Definition 28.2 The probability that the continuous random variable \({x}\) takes a value between \([a, b]\) is given by the area under the graph of probability density function \(f(x)\); that is, \(A = \int _{a}^{b}f(x)\ dx\). Note that \(f(x)\) can be greater than 1, however its integral must be equal to 1.
  • Basic Requirements (Similar to the Probability Basics & Discrete Probability )
    1. \(f(x) \geq 0\)
    2. \(A = \int _{-\infty}^{\infty}f(x)\ dx = 1\)
  • NOTE:
    • For a discrete random variable, the probability function \(f(x)\) provides the probability that the random variable assumes a particular value. With continuous random variables, the counterpart of the probability function is the probability density function \(f(x)\).
      • The difference is that the probability density function does not directly provide probabilities. - However, the area under the graph of \(f(x)\) corresponding to a given interval does provide the probability that the continuous random variable \({x}\) assumes a value in that interval.
      • So when we compute probabilities for continuous random variables we are computing the probability that the random variable assumes any value in an interval (NOT at any particular point).
      • Because the area under the graph of \(f(x)\) at any particular point is zero, the probability of any particular value of the random variable is zero.
      • It also means that the probability of a continuous random variable assuming a value in any interval is the same whether or not the endpoints are included.
    • Expected Value and Variance are given by (28.1)

\[\begin{align} E(x) &= \frac{a + b}{2} \\ \text{Var}(x) &= \frac{(b - a)^2}{12} \end{align} \tag{28.1}\]

28.4 Normal Probability Distribution

Definition 28.3 A normal distribution (\({\mathcal{N}}_{({\mu}, \, {\sigma}^2)}\)) is a type of continuous probability distribution for a real-valued random variable.
  • The general form of its probability density function is given by equation (28.2)
    • Normal distribution \({\mathcal{N}}_{({\mu}, \, {\sigma})}\) is also known as Gaussian or Gauss or Laplace-Gauss distribution
    • It is symmetrical
    • The entire family of normal distributions is differentiated by two parameters: the mean \({\mu}\) and the standard deviation \({\sigma}\). They determine the location and shape of the normal distribution.
    • The highest point on the normal curve is at the mean, which is also the median and mode of the distribution.
    • The normal distribution is symmetric around its mean. Its skewness measure is zero.
    • The tails of the normal curve extend to infinity in both directions and theoretically never touch the horizontal axis.
    • Larger values of the standard deviation result in wider, flatter curves, showing more variability in the data.
    • Probabilities for the normal random variable are given by areas under the normal curve. The total area under the curve for the normal distribution is 1.
    • Values of a normal random variable are given as: \(68.27\% ({\mu} \pm {\sigma}), 95.45\% ({\mu} \pm 2{\sigma}), 99.73\% ({\mu} \pm 3{\sigma})\). This is the basis of Empirical Rule

\[f(x) = {\frac {1}{{\sigma}{\sqrt {2 \pi}}}} e^{-{\frac {1}{2}}\left( {\frac {x-{\mu} }{\sigma}}\right) ^{2}} \tag{28.2}\]

(C06P01) Normal Distribution

Figure 28.1 (C06P01) Normal Distribution

Histogram

# #Histogram with Density Curve, Mean and Median: Normal Distribution
ee <- f_getRDS(xxNormal)
hh <- tibble(ee)
ee <- NULL
# #Basics
median_hh <- round(median(hh[[1]]), 3)
mean_hh <- round(mean(hh[[1]]), 3)
sd_hh <- round(sd(hh[[1]]), 3)
len_hh <- nrow(hh)
#
# #Base Plot: Creates Only Density Function Line
ii <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + geom_density() }
#
# #Change the line colour and alpha
ii <- ii + geom_density(alpha = 0.2, colour = "#21908CFF") 
#
# #Add Histogram with 50 bins, alpha and fill
ii <- ii + geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.4, fill = '#FDE725FF')
#
# #Full Vertical Line at Mean. Goes across Function Boundary on Y-Axis
#ii <- ii + geom_vline(aes(xintercept = mean_hh), color = '#440154FF')
#
# #Shaded Area Object for line /Area upto the the Function Boundary on Y-Axis
# #Mean
ii_mean <- ggplot_build(ii)$data[[1]] %>% filter(x <= mean_hh)  
# #Median
ii_median <- ggplot_build(ii)$data[[1]] %>% filter(x <= median_hh)
#
# #To show values which are less than Mean in colour
#ii <- ii + geom_area(data = ii_mean, aes(x = x, y = y), fill = 'blue', alpha = 0.5) 
#
# #Line upto the Density Curve at Mean 
ii <- ii + geom_segment(data = ii_mean, 
             aes(x = mean_hh, y = 0, xend = mean_hh, yend = density), color = "#440154FF")
#
# #Label 'Mean' 
ii <- ii + geom_text(aes(label = paste0("Mean= ", mean_hh), x = mean_hh, y = -Inf),
            color = '#440154FF', hjust = -0.5, vjust = -1, angle = 90, check_overlap = TRUE)
#
# #Similarly, Median Line and Label
ii <- ii + geom_segment(data = ii_median, 
             aes(x = median_hh, y = 0, xend = median_hh, yend = density), color = "#3B528BFF") +
  geom_text(aes(label = paste0("Median= ", median_hh), x = median_hh, y = -Inf), 
            color = '#3B528BFF', hjust = -0.4, vjust = 1.2, angle = 90, check_overlap = TRUE) 
#
# #Change Axis Limits
ii <- ii + coord_cartesian(xlim = c(-5, 5), ylim = c(0, 0.5))
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
# # Latex Labels 
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'), 
                TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'), 
                TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
ii <- ii + scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh)
#
# #Get Quantiles and Ranges of mean +/- sigma 
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3l_hh <- density_hh_tbl %>% filter(x <= mean_hh - 3 * sd_hh)
sig3r_hh <- density_hh_tbl %>% filter(x >= mean_hh + 3 * sd_hh)
sig2r_hh <- density_hh_tbl %>% filter(x >= mean_hh + 2 * sd_hh, x < mean_hh + 3 * sd_hh)
sig2l_hh <- density_hh_tbl %>% filter(x <= mean_hh - 2 * sd_hh, x > mean_hh - 3 * sd_hh)
sig1r_hh <- density_hh_tbl %>% filter(x >= mean_hh + sd_hh, x < mean_hh + 2 * sd_hh)
sig1l_hh <- density_hh_tbl %>% filter(x <= mean_hh - sd_hh, x > mean_hh - 2 * sd_hh)
#
# #Use (mean +/- 3 sigma) To Highlight. NOT ALL Zones have been highlighted
ii <- ii + geom_area(data = sig3l_hh, aes(x = x, y = y), fill = 'red') +
           geom_area(data = sig3r_hh, aes(x = x, y = y), fill = 'red')
#
# #Annotate Arrows 
ii <- ii + 
#  ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[3], y = 0.42, 
#                    yend = 0.42, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
#  ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[2], y = 0.45, 
#                    yend = 0.45, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
  ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[1], y = 0.48, 
                    yend = 0.48, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
#  ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[5], y = 0.42, 
#                    yend = 0.42, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
#  ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[6], y = 0.45, 
#                    yend = 0.45, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
  ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[7], y = 0.48, 
                    yend = 0.48, arrow = arrow(type = "closed", length = unit(0.02, "npc")))
#
# #Annotate Labels
ii <- ii + 
#  ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.42, label = "68.3%") +
#  ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.45, label = "95.4%") +
  ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.48, label = "99.7%")
#
# #Add a Theme and adjust Position of Title & Subtile (Both by plot.title.position) & Caption
# #"plot" or "panel"
ii <- ii + theme(#plot.tag.position = "topleft",
                 #plot.caption.position = "plot", 
                 #plot.caption = element_text(hjust = 0),
                 plot.title.position = "panel")
#
# #Title, Subtitle, Caption, Axis Labels, Tag
ii <- ii + labs(x = "x", y = "Density", 
        subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh, 
                          "; Median= ", median_hh, "; SD= ", sd_hh), 
        caption = "C06AA", tag = NULL,
        title = "Normal Distribution (Symmetrical)")
#
#ii

Plot LaTex

# #Syntax 
#latex2exp::Tex(r('$\sigma =10$'), output = "character")
# #Test Equation
plot(TeX(r'(abc: $\frac{2hc^2}{\lambda^5} \, \frac{1}{e^{\frac{hc}{\lambda k_B T}} - 1}$)'), cex=2)
plot(TeX(r'(xyz: $f(x) =\frac{1}{\sigma \sqrt{2\pi}}\, e^{- \, \frac{1}{2} \,\left(\frac{x - \mu}{\sigma}\right)^2} $)'), cex=2)

Annotate Plot

# #Syntax
ggpp::annotate("text", x = -2, y = 0.3, label=TeX(r'($\sigma =10$)', output = "character"), parse = TRUE, check_overlap = TRUE)
# #NOTE: Complex Equations like Normal Distribution are crashing the R.
ggpp::annotate("text", x = -2, y = 0.3, label=TeX(r'($f(x) =\frac{1}{\sigma \sqrt{2\pi}}\, e^{- \, \frac{1}{2} \, \left(\frac{x - \mu}{\sigma}\right)^2} $)', output = "character"), parse = TRUE, check_overlap = TRUE)

ggplot_build()

# #Data
bb <- f_getRDS(xxNormal)
hh <- tibble(bb)
# #Base Plot
ii <- hh %>% { ggplot(data = ., mapping = aes(x = bb)) + geom_density() }
# #Attributes 
attributes(ggplot_build(ii))$names
## [1] "data"   "layout" "plot"
#
str(ggplot_build(ii)$data[[1]])
## 'data.frame':    512 obs. of  18 variables:
##  $ y          : num  0.000504 0.00052 0.000532 0.000541 0.000545 ...
##  $ x          : num  -3.63 -3.61 -3.6 -3.58 -3.57 ...
##  $ density    : num  0.000504 0.00052 0.000532 0.000541 0.000545 ...
##  $ scaled     : num  0.00126 0.0013 0.00133 0.00136 0.00137 ...
##  $ ndensity   : num  0.00126 0.0013 0.00133 0.00136 0.00137 ...
##  $ count      : num  5.04 5.2 5.32 5.41 5.45 ...
##  $ n          : int  10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
##  $ flipped_aes: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ PANEL      : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ group      : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ ymin       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ymax       : num  0.000504 0.00052 0.000532 0.000541 0.000545 ...
##  $ fill       : logi  NA NA NA NA NA NA ...
##  $ weight     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ colour     : chr  "black" "black" "black" "black" ...
##  $ alpha      : logi  NA NA NA NA NA NA ...
##  $ size       : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ linetype   : num  1 1 1 1 1 1 1 1 1 1 ...

Errors

ERROR 28.1 Error in is.finite(x) : default method not implemented for type ’list’
  • For ggplot() subsetting inside aes() is discouraged.
  • Assuming names(hh)[1] is “ee”
    • either use (x = “ee”) : Use of hh[1] or .[1] will throw error
    • or use (x = .data[[“ee”]]) : Use of hh[[1]] or .[[1]] will work but would throw warning.
      • Warning “Warning: Use of .[[1]] is discouraged. Use .data[[1]] instead.”
      • Using .data[[1]] will throw different error
ERROR 28.2 Error: Must subset the data pronoun with a string.
  • ggplot() | aes() | using .data[[1]] will throw this error
  • use .data[[“ee”]] or “ee”
    • .data is pronoun for an environment, it is for scope resolution, not dataframe like dot (.)

UNICODE

STOP! STOP! Just STOP! using UNICODE for R Console on WINDOWS (UTF-8 Issue).

28.5 Standard Normal

Definition 28.4 A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal{z}}_{({\mu} = 0, \, {\sigma} = 1)}\)

\[f(z) = \varphi(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{z^2}{2}} \tag{28.3}\]

  • Refer equation (28.3)
    • Here, the factor \(1/{\sqrt{2\pi}}\) ensures that the total area under the curve \(\varphi(x)\) is equal to one.
    • The factor \(1/2\) in the exponent ensures that the distribution has unit variance, and therefore also unit standard deviation.
    • This function is symmetric around \(x = 0\), where it attains its maximum value \(1/{\sqrt{2\pi}}\) and has inflection points at \(x = +1\) and \(x = -1\).
    • While individual observations from normal distributions are referred to as \({x}\), they are referred to as \({z}\) in the z-distribution.

25.20 The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.

  • NOTE (R, C) notation denotes Row x Column of a Table
    • Because the standard normal random variable is continuous, \(P(z \leq 1.00) = P(z < 1.00)\)
    • The cumulative probability corresponding to \(z = 1.00\) is the table value located at the intersection of the row labeled \({1.0}\) and the column labeled \({.00}\) i.e. \(P_{\left(z\leq 1.00\right)} = P_{\left(1.0, \, .00\right)} = 0.8413\)
    • To compute the probability that \({z}\) is in the interval between −.50 and 1.25
      • \(P_{\left(-0.50 \leq z\leq 1.25\right)} = P_{\left(z\leq 1.25\right)} - P_{\left(z\leq -0.50 \right)} = P_{\left(1.2, \, .05\right)} - P_{\left(-0.50, \, .00\right)} = 0.8944 - 0.3085 = 0.5859\)
    • To compute the probability of obtaining a z value of at least 1.58
      • \(P_{\left(z\geq 1.58\right)} = 1 - P_{\left(z\leq 1.58\right)} = 1 - P_{\left(1.5, \, .08\right)} = 1 - 0.9429 = 0.0571\)
    • To compute the probability that the standard normal random variable is within one standard deviation of the mean
      • \(P_{\left(-1.00 \leq z\leq 1.00\right)} = P_{\left(z\leq 1.00\right)} - P_{\left(z\leq -1.00 \right)} = P_{\left(1.0, \, .00\right)} - P_{\left(-1.0, \, .00\right)} = 0.8413 - 0.1587 = 0.6826\)
  • Reverse i.e. given the probability, find out the z-value
    • Find a z value such that the probability of obtaining a larger z value is .10
      • The standard normal probability table gives the area under the curve to the left of a particular z value, which would be \(P_{\left(z\right)} = 1 - 0.10 = 0.9000 \approx P_{\left(1.2, \, .08\right)} \to z = 1.28\)

\({z} \in \mathbb{R} \iff P_{(z)} \in (0, 1)\)

Cal P

# #Find Commulative Probability P corresponding to the given 'z' value
# #Area under the curve to the left of z-value = 1.00
pnorm(q = 1.00)
## [1] 0.8413447

pnorm()

# #Find Commulative Probability P corresponding to the given 'z' value
# #Area under the curve to the left of z-value = 1.00
# #pnorm(q = 1.00) #(Default) 'lower.tail = TRUE'
z_ii <- 1.00 
p_ii <- round(pnorm(q = z_ii, lower.tail = TRUE), 4)
cat(paste0("P(z <= ", format(z_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(z <= 1.000) = 0.8413
#
# #Probability that z is in the interval between −.50 and 1.25 #0.5859
z_min_ii <- -0.50
z_max_ii <- 1.25
p_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ", 
           format(z_max_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(-0.500 <= z <= 1.250) = 0.5858
#
# #Probability of obtaining a z value of at least 1.58 #0.0571
z_ii <- 1.58
p_ii <- round(pnorm(q = z_ii, lower.tail = FALSE), 4)
cat(paste0("P(z >= ", format(z_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(z >= 1.580) = 0.0571
#
# #Probability that the z is within one standard deviation of the mean i.e. [-1, 1] #0.6826
z_min_ii <- -1.00
z_max_ii <- 1.00
p_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ", 
           format(z_max_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(-1.000 <= z <= 1.000) = 0.6827

Cal Z

# #Find a z value such that the probability of obtaining a larger z value is .10
# #z-value for which Area under the curve towards Right is 0.10
qnorm(p = 1 - 0.10)
## [1] 1.281552
qnorm(p = 0.10, lower.tail = FALSE)
## [1] 1.281552

qnorm()

# #Find a z value such that the probability of obtaining a larger z value is .10
# #z-value for which Area under the curve towards Right is 0.10 i.e. right >10%
#qnorm(p = 1 - 0.10)
#qnorm(p = 0.10, lower.tail = FALSE)
p_r_ii <- 0.10 
p_l_ii <- 1 - p_r_ii
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
z_jj <- round(qnorm(p = p_r_ii, lower.tail = FALSE), 4)
stopifnot(identical(z_ii, z_jj))
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ", 
           format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.900 (i.e. (Right) 1-P(z) = 0.100) at z = 1.2816

28.6 Any Normal

  • Any normal distribution can be standardized by converting the individual values into z-scores.
    • z-scores tell that how many standard deviations away from the mean each value lies.
  • Probabilities for all normal distributions are computed by using the standard normal distribution.
    • A normal distribution \({\mathcal{N}}_{({\mu}, \, {\sigma})}\) is converted to the standard normal distribution \({\mathcal{z}}_{({\mu} = 0, \, {\sigma} = 1)}\) by equation (28.4) (Similar to equation (25.14))
    • If \({x}\) is a random variable from this population, then its z-score is \(Z = \frac {X - {\mu}}{\sigma}\)
    • If \(\overline{X}\) is the mean of a sample of size \({n}\) from this population, then the standard error is \({\sigma}/{\sqrt{n}}\) and thus the z-score is \(Z = \frac{\overline{X} - {\mu}}{{\sigma}/{\sqrt{n}}}\)
    • If \(\sum {X}\) is the total of a sample of size \({n}\) from this population, then the expected total is \(n\times{\mu}\) and the standard error is \({\sigma}{\sqrt{n}}\). Thus the z-score is \(Z = {\frac{\sum{X}-n{\mu}}{{\sigma}{\sqrt{n}}}}\)
  • Thus
    • \(x = {\mu} \to z = 0\) i.e. A value of \({x}\) equal to its mean \({\mu}\) corresponds to \(z = 0\).
    • \(x = {\mu} + {\sigma} \to z = 1\) i.e. an \({x}\) value that is one standard deviation above its mean \(({\mu} + {\sigma})\) corresponds to \(z = 1\).
      • Thus, we can interpret \({z}\) as the number of standard deviations \(({\sigma})\) that the normal random variable \({x}\) is from its mean \(({\mu})\).
    • For a normal distribution \({\mathcal{N}}_{({\mu} = 10, \, {\sigma} = 2)}\), What is the probability that the random variable x is between 10 and 14
      • At x = 10, z = 0 and at x = 14, z = 2, Thus
      • \(P_{\left(0 \leq z\leq 2\right)} = P_{\left(z\leq 2\right)} - P_{\left(z\leq 0 \right)} = P_{\left(2.0, \, .00\right)} - P_{\left(0, \, .00\right)} = 0.9772 - 0.5000 = 0.4772\)
    • Grear Tire Company Problem
      • For a new tire product the milage is a Normal Function \({\mathcal{N}}_{({\mu} = 36500, \, {\sigma} = 5000)}\).
      • What percentage of the tires can be expected to last more than 40,000 miles, i.e., what is the probability that the tire mileage, x, will exceed 40,000
        • Solution: 24.2%
      • Let us now assume that Grear is considering a guarantee that will provide a discount on replacement tires if the original tires do not provide the guaranteed mileage. What should the guarantee mileage be if Grear wants no more than 10% of the tires to be eligible for the discount guarantee
        • Solution: \(30092 \approx 30100 \text{ miles}\)

\[z = \frac{x - {\mu}}{{\sigma}} \tag{28.4}\]

Reasons to convert normal distributions into the standard normal distribution:

  • To find the probability of observations in a distribution falling above or below a given value
  • To find the probability that a sample mean significantly differs from a known population mean
  • To compare scores on different distributions with different means and standard deviations

Each z-score is associated with a probability, or p-value, that gives the likelihood of values below that z-score occurring. By converting an individual value into a z-score, we can find the probability of all values up to that value occurring in a normal distribution.

The z-score is the test statistic used in a z-test. The z-test is used to compare the means of two groups, or to compare the mean of a group to a set value. Its null hypothesis typically assumes no difference between groups.

The area under the curve to the right of a z-score is the p-value, and it is the likelihood of your observation occurring if the null hypothesis is true.

Usually, a p-value of 0.05 or less means that your results are unlikely to have arisen by chance; it indicates a statistically significant effect.

Cal P

# #For N(mu =10, sd =2) Probability that X is in [10, 14]
# #Same as P(0 <= z <= 2)
mu_ii <- 10
sd_ii <- 2
x_min_ii <- 10
x_max_ii <- 14
#
z_min_ii <- (x_min_ii - mu_ii) /sd_ii #0
z_max_ii <- (x_max_ii - mu_ii) /sd_ii #2
#
pz_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
# #OR
px_ii <- round(pnorm(q = x_max_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE) - 
                  pnorm(q = x_min_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE), 4)
stopifnot(identical(pz_ii, px_ii))
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ", 
           format(z_max_ii, nsmall = 3), ") = ", pz_ii, "\n"))
## P(0.000 <= z <= 2.000) = 0.4772
cat(paste0("P(", x_min_ii, " <= x <= ", x_max_ii, ") = ", format(px_ii, nsmall = 3), "\n"))
## P(10 <= x <= 14) = 0.4772

Grear Tire

# #Grear Tire N(mu = 36500, sd =5000)
# #Probability that the tire mileage, x, will exceed 40000 # 24.2% Tires
mu_ii <- 36500
sd_ii <- 5000
x_ii <- 40000
#
z_ii <- (x_ii - mu_ii)/sd_ii
#
#pnorm(q = 40000, mean = 36500, sd = 5000, lower.tail = FALSE)
pz_ii <- round(pnorm(q = z_ii, lower.tail = FALSE), 4)
px_ii <- round(pnorm(q = x_ii, mean = mu_ii, sd = sd_ii, lower.tail = FALSE), 4)
stopifnot(identical(px_ii, pz_ii))
#
cat(paste0("P(x >= ", x_ii, ") = ", format(px_ii, nsmall = 4), " (", 
           round(100* px_ii, 2), "%)\n"))
## P(x >= 40000) = 0.2420 (24.2%)
#
# #What should the guarantee mileage be if no more than 10% of the tires to be eligible 
# #for the discount guarantee i.e. left <10% # ~30100 miles
p_l_ii <- 0.10
p_r_ii <- 1 - p_l_ii
#
#qnorm(p = 0.10, mean = 36500, sd = 5000)
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
xz_ii <- z_ii * sd_ii + mu_ii
#
x_ii <- round(qnorm(p = p_l_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE), 4)
stopifnot(abs(xz_ii - x_ii) < 1)
cat(paste0("(Left) P(x) = ", p_l_ii, " (i.e. (Right) 1-P(z) = ", p_r_ii, 
           ") at x = ", round(x_ii, 1), "\n"))
## (Left) P(x) = 0.1 (i.e. (Right) 1-P(z) = 0.9) at x = 30092.2

Exercises

  • “ForLater”
    • Exercises
    • Normal Approximation of Binomial Probabilities
    • Exponential Probability Distribution
    • Relationship Between the Poisson and Exponential Distributions

Validation


29 Sampling Distributions

29.1 Overview

29.2 Definitions (Ref)

23.2 Elements are the entities on which data are collected. (Generally ROWS)

23.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)

23.20 A population is the set of all elements of interest in a particular study.

23.21 A sample is a subset of the population.

23.22 The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.

29.3 Sample

The sample contains only a portion of the population. Some sampling error is to be expected. So, the sample results provide only estimates of the values of the corresponding population characteristics.

Definition 29.1 The sampled population is the population from which the sample is drawn.
Definition 29.2 Frame is a list of the elements that the sample will be selected from.
Definition 29.3 The target population is the population we want to make inferences about. Generally (adn preferably), it will be same as ‘Sampled-Population,’ but it may differ also.
Definition 29.4 A simple random sample (SRS) is a set of \({k}\) objects in a population of \({N}\) objects where all possible samples are equally likely to happen. The number of such different simple random samples is \(C_k^N\)
Definition 29.5 Sampling without replacement: Once an element has been included in the sample, it is removed from the population and cannot be selected a second time.
Definition 29.6 Sampling with replacement: Once an element has been included in the sample, it is returned to the population. A previously selected element can be selected again and therefore may appear in the sample more than once.
  • Infinite Population
    • Sometimes the population is infinitely large or the elements of the population are being generated by an ongoing process for which there is no limit on the number of elements that can be generated.
    • Thus, it is not possible to develop a list of all the elements in the population. This is considered the infinite population case.
    • With an infinite population, we cannot select a ‘simple random sample’ because we cannot construct a frame consisting of all the elements.
    • In the infinite population case, statisticians recommend selecting what is called a ‘random sample.’
Definition 29.7 A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.

Random sample vs. SRS

  • Random sample: every element of the population has a (nonzero) probability of being drawn.
    • each element does not necessarily have an equal chance of being chosen.
  • SRS: every element of the population has the same (nonzero) probability of being drawn.
    • SRS is thus a special case of a random sample.
    • SRS is a subset of a statistical population in which each member of the subset has an equal probability of being chosen.
  • Elaboration of the Two conditions for Random Sample
    • Example: Consider a production line designed to fill boxes of a breakfast cereal.
      • Each element selected comes from the same population.
        • To ensure this, the boxes must be selected at approximately the same point in time.
        • This way the inspector avoids the possibility of selecting some boxes when the process is operating properly and other boxes when the process is not operating properly.
      • Each element is selected independently.
        • It is satisfied by designing the production process so that each box of cereal is filled independently.
    • Example: Consider the population of customers arriving at a fast-food restaurant.
      • McDonald, implemented a random sampling procedure for this situation.
      • The sampling procedure was based on the fact that some customers presented discount coupons.
      • Whenever a customer presented a discount coupon, the next customer served was asked to complete a customer profile questionnaire. Because arriving customers presented discount coupons randomly and independently of other customers, this sampling procedure ensured that customers were selected independently.

29.4 Point Estimation

Definition 29.8 A population proportion \({P}\), is a parameter that describes a percentage value associated with a population. It is given by \(P = \frac{X}{N}\), where \({x}\) is the count of successes in the population, and \({N}\) is the size of the population. It is estimated through sample proportion \(\overline{p} = \frac{x}{n}\), where \({x}\) is the count of successes in the sample, and \({N}\) is the size of the sample obtained from the population.
Definition 29.9 To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.
Definition 29.10 A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)
Definition 29.11 The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.

29.5 Sampling Distributions

Definition 29.12 The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).

Suppose, from a Population, we take a sample of size \({n}\) and calculate point estimate mean \(\overline{x}_{1}\). Further, we can select another random sample from the Population and get another point estimate mean \(\overline{x}_{2}\). If we repeat this process for 500 times, we will have a frame of \(\{\overline{x}_{1}, \overline{x}_{2}, \ldots, \overline{x}_{500}\}\).

If we consider the process of selecting a simple random sample as an experiment, the sample mean \({\overline{x}}\) is the numerical description of the outcome of the experiment. Thus, the sample mean \({\overline{x}}\) is a random variable. As a result, just like other random variables, \({\overline{x}}\) has a mean or expected value, a standard deviation, and a probability distribution. Because the various possible values of \({\overline{x}}\) are the result of different simple random samples, the probability distribution of \({\overline{x}}\) is called the sampling distribution of \({\overline{x}}\). Knowledge of this sampling distribution and its properties will enable us to make probability statements about how close the sample mean \({\overline{x}}\) is to the population mean \({\mu}\).

Just as with other probability distributions, the sampling distribution of \({\overline{x}}\) has an expected value or mean, a standard deviation, and a characteristic shape or form.

29.5.1 Mean

  • Expected Value of \({\overline{x}}\)
    • The mean of the \({\overline{x}}\) random variable is the expected value of \({\overline{x}}\).
    • Let \(E(\overline{x})\) represent the expected value of \({\overline{x}}\) and \({\mu}\) represent the mean of the population from which we are selecting a simple random sample. Then, \(E(\overline{x}) = \mu\)
    • When the expected value of a point estimator equals the population parameter, we say the point estimator is unbiased. Thus, \({\overline{x}}\) is an unbiased estimator of the population mean \({\mu}\).

29.5.2 Standard Deviation

Definition 29.13 In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’
  • Standard Deviation of \({\overline{x}}\), \(\sigma_{\overline{x}}\) is given by (29.1)
    • \(\sqrt{\frac{N - n}{N-1}}\) is commonly referred to as the finite population correction factor. With large population, it approaches 1
    • Thus, \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\) becomes good approximation when the sample size is less than or equal to 5% of the population size; that is, \(n/N \leq 0.05\).
    • To further emphasize the difference between \(\sigma_{\overline{x}}\) and \({\sigma}\), we refer to the standard deviation of \({\overline{x}}\), \(\sigma_{\overline{x}}\), as the standard error of the mean.
    • (Sampling Fluctuation) The standard error of the mean is helpful in determining how far the sample mean may be from the population mean.

\[\begin{align} \text{Finite Population:} \sigma_{\overline{x}} &= \sqrt{\frac{N - n}{N-1}}\left(\frac{\sigma}{\sqrt{n}} \right) \\ \text{Infinite Population:} \sigma_{\overline{x}} &= \frac{\sigma}{\sqrt{n}} \end{align} \tag{29.1}\]

Definition 29.14 A sampling error is the difference between a population parameter and a sample statistic.
  • Standard error is a measure of sampling error. There are others, but standard error is, by far, the most commonly used.
    • However, sampling error is NOT the only reason for a difference between the survey estimate and the true value in the population.
    • Another, and arguably more important, reason for this difference is bias.
      • Bias can be introduced when designing the sampling scheme.
      • Most forms of bias cannot be calculated nor measured after the data are collected, and are, therefore, often invisible.
      • Bias must be avoided by using correct procedures at each step of the survey process.
      • Bias has NOTHING to do with sample size which affects only sampling error and standard error.
      • As a result, large sample sizes do NOT eliminate bias. In fact, the larger sample size may increase the likelihood of bias in the data collection.

Refer Effect of Sample Size and Repeat Sampling

(B12A B12B) Effect of Sample Size vs Repeat Sampling(B12A B12B) Effect of Sample Size vs Repeat Sampling

Figure 29.1 (B12A B12B) Effect of Sample Size vs Repeat Sampling

29.6 Synopsis

“ForLater”

If a statistically independent sample of \({n}\) observations \({{x}_1, {x}_2, \ldots, {x}_n}\) is taken from a statistical population with a standard deviation of \(\sigma\), then the mean value calculated from the sample \(\overline{x}\) will have an associated standard error of the mean \(\sigma_\overline{x}\) given by

\[\sigma_\overline{x} = \frac{\sigma}{\sqrt{n}} \tag{29.2}\]

The standard deviation \(\sigma\) of the population being sampled is seldom known. Therefore, \(\sigma_\overline{x}\) is usually estimated by replacing \(\sigma\) with the sample standard deviation \(\sigma_{x}\) instead:

\[\sigma_\overline{x} \approx \frac{\sigma_{x}}{\sqrt{n}} \tag{29.3}\]

As this is only an ‘estimator’ for the true “standard error,” other notations are used, such as:

\[\widehat{\sigma}_\overline{x} = \frac{\sigma_{x}}{\sqrt{n}} \tag{29.4}\]

OR:

\[{s}_\overline{x} = \frac{s}{\sqrt{n}} \tag{29.5}\]

Key:

  • \(\sigma\) : Standard deviation of the population
  • \(\sigma_{x}\) : Standard deviation of the sample
  • \(\sigma_\overline{x}\) : Standard deviation of the mean
    • the standard error
  • \(\widehat{\sigma}_\overline{x}\) : Estimator of the standard deviation of the mean
    • the most often calculated quantity
    • also often colloquially called the standard error

Non-mathematical view:

  • The SD (standard deviation) quantifies scatter — how much the values vary from one another.
  • The SEM (standard error of the mean) quantifies how precisely you know the true mean of the population.
    • It takes into account both the value of the SD and the sample size.
  • Both SD and SEM are in the same units i.e. the units of the data (in contrast, variance has squared units).
  • The SEM, by definition, is always smaller than the SD. (divided by \(\sqrt{n}\))
    • The SEM gets smaller as your samples get larger.
    • So, the mean of a large sample is likely to be closer to the true population mean than is the mean of a small sample.
    • With a huge sample, you will know the value of the mean with a lot of precision even if the data is scattered.
  • The SD does not change predictably as you acquire more data.
    • The SD you compute from a sample is the best possible estimate of the SD of the overall population.
    • As you collect more data, you will assess the SD of the population with more precision. But you cannot predict whether the SD from a larger sample will be bigger or smaller than the SD from a small sample.
    • Technically, variance does not change predictably. Above is a simplification. For details, see Difference between SE and SD

29.6.1 Form

Form of the Sampling Distribution of \({\overline{x}}\)

  • When the population has a normal distribution, the sampling distribution of \({\overline{x}}\) is normally distributed for any sample size.

  • When the population from which we are selecting a random sample does not have a normal distribution, the central limit theorem is helpful in identifying the shape of the sampling distribution of \({\overline{x}}\).

Definition 29.15 Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.

How large the sample size needs to be before the central limit theorem applies and we can assume that the shape of the sampling distribution is approximately normal

  • For most applications, the sampling distribution of \({\overline{x}}\) can be approximated by a normal distribution whenever the sample is size 30 or more.
  • In cases where the population is highly skewed or outliers are present, samples of size 50 may be needed.
  • Finally, if the population is discrete, the sample size needed for a normal approximation often depends on the population proportion.

Ex: EAI

Task of developing a profile of 2500 managers. The characteristics to be identified include the mean annual salary for the managers and the proportion of managers having completed a training.

  • Population
    • Population Size N = 2500 managers
    • Training: 1500/2500 managers have completed Training
    • Salary: \({\mathcal{N}}_{(\mu = 51800, \, \sigma = 4000)}\)
    • Proportion of the population that completed the training program \(p = \frac{1500}{2500} = 0.60\)
  • Suppose that a sample of 30 managers will be used. i.e. \({n=30}\) and 19 Yes for Training
    • Suppose, sample have \({\mathcal{N}}_{(\overline{x} = 51814, \, s = 3348)}\)
    • Also, \(\overline{p} = \frac{x}{n} = \frac{19}{30} = 0.63\)
  • If 500 such samples are taken, where each have their own \({\overline{x}}\)
    • Then their expected value \(E(\overline{x}) = \mu = 51800\)
    • Standard Error \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}} = \frac{4000}{\sqrt{30}} = 730.3\)
  • Suppose the director believes the sample mean \({\overline{x}}\) will be an acceptable estimate of the population mean \({\mu}\) if the sample mean is within 500 dollars of the population mean.
    • However, it is not possible to guarantee that the sample mean will be within 500 dollars of the population
    • We can reframe the request in probability terms i.e.
      • What is the probability that the sample mean computed using a simple random sample of 30 EAI managers will be within 500 dollars of the population mean
      • i.e. Probability that \(\overline{x} \in [51300, 52300]\)
        • For \(z = \frac{\overline{x} - \mu}{\sigma_{\overline{x}}}\)
        • For \(\overline{x} = 52300 \Rightarrow z = \frac{52300 - 51800}{730.30} = 0.68\)
        • For \(\overline{x} = 51300 \Rightarrow z = \frac{51300 - 51800}{730.30} = -0.68\)
    • \(P_{(51300 \leq \overline{x} \leq 52300)} = P_{(\overline{x} \leq 52300)} - P_{(\overline{x} \leq 51300)} = P_{(z \leq 0.68)} - P_{(z \leq -0.68)} = 0.7517 - 0.2483 = 0.5034\)
      • A simple random sample of 30 EAI managers has a 0.5034 probability of providing a sample mean \({\overline{x}}\) that is within 500 dollars of the population mean.
        • Thus, there is a \(1 − 0.5034 = 0.4966\) probability that the difference between \({\overline{x}}\) and \({\mu}\) will be more than 500 dollars.
        • In other words, a simple random sample of 30 EAI managers has roughly a 50-50 chance of providing a sample mean within the allowable 500 dollars. Perhaps a larger sample size should be considered.
        • Let us explore this possibility by considering the relationship between the sample size and the sampling distribution of \({\overline{x}}\).
  • Impact of \(n = 100\) in place of \(n =30\)
    • First note that \(E(\overline{x}) = \mu\) regardless of the sample size. Thus, the mean of all possible values of \({\overline{x}}\) is equal to the population mean \({\mu}\) regardless of the sample size \({n}\).
    • However, standard error is reduced to \(\sigma_{\overline{x}} = \frac{4000}{\sqrt{100}} = 400\)
    • For \(\overline{x} = 52300 \Rightarrow z = \frac{52300 - 51800}{400} = 1.25\)
    • For \(\overline{x} = 51300 \Rightarrow z = \frac{51300 - 51800}{400} = -1.25\)
    • Thus \(P_{(51300 \leq \overline{x} \leq 52300)} = P_{(\overline{x} \leq 52300)} - P_{(\overline{x} \leq 51300)} = P_{(z \leq 1.25)} - P_{(z \leq -1.25)} = 0.8944 - 0.1056 = 0.7888\)
    • Thus, by increasing the sample size from 30 to 100 EAI managers, we increase the probability of obtaining a sample mean within 500 dollars of the population mean from 0.5034 to 0.7888.

Caution: Here, we took advantage of the fact that the population mean \({\mu}\) and the population standard deviation \({\sigma}\) were known. However, usually these values will be unknown.

“ForLater”

Properties of Point Estimators

Three properties of good point estimators: unbiased, efficiency, and consistency.

\(\theta = \text{the population parameter of interest}\) \(\hat{\theta} = \text{the sample statistic or point estimator of } \theta\)

  • Unbiased
    • If the expected value of the sample statistic is equal to the population parameter being estimated, the sample statistic is said to be an unbiased estimator of the population parameter
  • Efficiency
    • When sampling from a normal population, the standard error of the sample mean is less than the standard error of the sample median. Thus, the sample mean is more efficient than the sample median.
  • Consistency
    • A point estimator is consistent if the values of the point estimator tend to become closer to the population parameter as the sample size becomes larger.

Other Sampling Methods

  • Stratified Random Sampling
  • Cluster Sampling
  • Systematic Sampling
  • Convenience Sampling
  • Judgment Sampling

Validation


30 Interval Estimation

30.1 Overview

30.2 Interval Estimate

Definition 30.1 Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)
Definition 30.2 Confidence interval is another name for an interval estimate. Normally it is given as \(({\gamma} = 1 - {\alpha})\). Ex: 95% confidence interval
Definition 30.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

Known SD

In order to develop an interval estimate of a population mean, either the population standard deviation \({\sigma}\) or the sample standard deviation \({s}\) must be used to compute the margin of error. In most applications \({\sigma}\) is not known, and \({s}\) is used to compute the margin of error.

In some applications, large amounts of relevant historical data are available and can be used to estimate the population standard deviation prior to sampling. Also, in quality control applications where a process is assumed to be operating correctly, or ‘in control,’ it is appropriate to treat the population standard deviation as known.

Sampling distribution of \({\overline{x}}\) can be used to compute the probability that \({\overline{x}}\) will be within a given distance of \({\mu}\).

Example: Lloyd Department Store

  • Each week Lloyd Department Store selects a simple random sample of 100 customers in order to learn about the amount spent per shopping trip.
    • With \({x}\) representing the amount spent per shopping trip, the sample mean \({\overline{x}}\) provides a point estimate of \({\mu}\), the mean amount spent per shopping trip for the population of all Lloyd customers. Based on the historical data, Lloyd now assumes a known value of \(\sigma = 20\) for the population standard deviation.
    • During the most recent week, Lloyd surveyed 100 customers \((n = 100)\) and obtained a sample mean of \(\overline{x} = 82\).
    • we can conclude that the sampling distribution of \({\overline{x}}\) follows a normal distribution with a standard error of \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}} = \frac{20}{\sqrt{100}} =2\).
    • Because the sampling distribution shows how values of \({\overline{x}}\) are distributed around the population mean \({\mu}\), the sampling distribution of \({\overline{x}}\) provides information about the possible differences between \({\overline{x}}\) and \({\mu}\).
    • Using the standard normal probability table, we find that 95% of the values of any normally distributed random variable are within \(\pm 1.96\) standard deviations of the mean i.e. \([\mu - 1.96 \sigma, \mu + 1.96\sigma]\).
      • Thus, 95% of the \({\overline{x}}\) values must be within \(\pm 1.96 \sigma_{\overline{x}}\) of the mean \({\mu}\).
      • In the Lloyd example we know that the sampling distribution of \({\overline{x}}\) is normally distributed with a standard error of \(\sigma_{\overline{x}} =2\).
      • we can conclude that 95% of all \({\overline{x}}\) values obtained using a sample size of \(n = 100\) will be within \((\pm 1.96 \times 2 = \pm 3.92)\) of the population mean \({\mu}\).
    • As given above, sample mean was \(\overline{x} = 82\)
      • Interval estimate of \(\overline{x} = 82 \pm 3.92 = [78.08, 85.92]\)
      • Because 95% of all the intervals constructed using \(\overline{x} = 82 \pm 3.92\) will contain the population mean, we say that we are 95% confident that the interval 78.08 to 85.92 includes the population mean \({\mu}\).
      • We say that this interval has been established at the 95% confidence level.
      • The value 0.95 is referred to as the confidence coefficient, and the interval 78.08 to 85.92 is called the 95% confidence interval.

Interval Estimate of a Population Mean: \({\sigma}\) known is given by equation (30.1)

\[\begin{align} \overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \end{align} \tag{30.1}\]

where \((1 − \alpha)\) is the confidence coefficient and \(z_{\alpha/2}\) is the z-value providing an area of \(\alpha/2\) in the upper tail of the standard normal probability distribution.

For a 95% confidence interval, the confidence coefficient is \((1 − \alpha) = 0.95\) and thus, \(\alpha = 0.05\). Using the standard normal probability table, an area of \(\alpha/2 = 0.05/2 = 0.025\) in the upper tail provides \(z_{.025} = 1.96\).

# #Find z-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ", 
           format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.975 (i.e. (Right) 1-P(z) = 0.025) at z = 1.96
#
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
#
# #Left Tail Test
round(qnorm(p = xxalpha, lower.tail = TRUE), 4)
##     10%      5%    5/2%      1%    1/2% 
## -1.2816 -1.6449 -1.9600 -2.3263 -2.5758
#
# #Right Tail Test
round(qnorm(p = xxalpha, lower.tail = FALSE), 4)
##    10%     5%   5/2%     1%   1/2% 
## 1.2816 1.6449 1.9600 2.3263 2.5758

30.3 Unknown SD

Definition 30.4 When \({s}\) is used to estimate \({\sigma}\), the margin of error and the interval estimate for the population mean are based on a probability distribution known as the t distribution.

The t distribution is a family of similar probability distributions, with a specific t distribution depending on a parameter known as thedegrees of freedom. As the number of degrees of freedom increases, the difference between the t distribution and the standard normal distribution becomes smaller and smaller.

Just as \(z_{0.025}\) was used to indicate the z value providing a 0.025 area in the upper tail of a standard normal distribution, \(t_{0.025}\) indicates a 0.025 area in the upper tail of a t distribution. In general, the notation \(t_{\alpha/2}\) represents a t value with an area of \(\alpha/2\) in the upper tail of the t distribution.

As the degrees of freedom increase, the t distribution approaches the standard normal distribution. Ex: \(t_{0.025} = 2.262 \, (\text{DOF} = 9)\), \(t_{0.025} = 2.200 \, (\text{DOF} = 60)\), and \(t_{0.025} = 1.96 \, (\text{DOF} = \infty) = z_{0.025}\)

Interval Estimate of a Population Mean: \({\sigma}\) Unknown is given by equation (30.2)

\[\begin{align} \overline{x} \pm t_{\alpha/2} \frac{s}{\sqrt{n}} \end{align} \tag{30.2}\]

where \({s}\) is the sample standard deviation, \((1 − \alpha)\) is the confidence coefficient and \(t_{\alpha/2}\) is the t-value providing an area of \(\alpha/2\) in the upper tail of the t distribution with \({n-1}\) degrees of freedom.

Refer equation (25.12), the expression for the sample standard deviation is

\[{s} = \sqrt{\frac{\sum \left(x_i - \overline{x}\right)^2}{n-1}}\]

Definition 30.5 The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).

Why \((n-1)\) are the degrees of freedom

  • Degrees of freedom refer to the number of independent pieces of information that go into the computation. i.e. \(\{(x_{1}-\overline{x}), (x_{2}-\overline{x}), \ldots, (x_{n}-\overline{x})\}\)
  • However, \(\sum (x_{i}-\overline{x}) = 0\) for any data set.
  • Thus, only \((n − 1)\) of the \((x_{i}-\overline{x})\) values are independent.
    • if we know \((n − 1)\) of the values, the remaining value can be determined exactly by using the condition.

Larger sample sizes are needed if the distribution of the population is highly skewed or includes outliers.

Cal T

# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
# #Find t-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
# #t-tables are unique for different degrees of freedom i.e. for DOF = 9 
dof_ii <- 9
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.2622 (dof = 9)

qt()

# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
# #Find t-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
# #t-tables are unique for different degrees of freedom i.e. for DOF = 9 
dof_ii <- 9
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.2622 (dof = 9)
#
dof_ii <- 60
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.0003 (dof = 60)
#
dof_ii <- 600
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.9639 (dof = 600)
#
# #t-table have Infinity Row which is same as z-table. For DOF >100, it can be used.
dof_ii <- Inf
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.96 (dof = Inf)

#
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ", 
           format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.975 (i.e. (Right) 1-P(z) = 0.025) at z = 1.96

Ex: Credit Card

# #A sample of n = 70 households provided the credit card balances.
xxCreditCards <- c(9430, 7535, 4078, 5604, 5179, 4416, 10676, 1627, 10112, 6567, 13627, 18719, 14661, 12195, 10544, 13659, 7061, 6245, 13021, 9719, 2200, 10746, 12744, 5742, 7159, 8137, 9467, 12595, 7917, 11346, 12806, 4972, 11356, 7117, 9465, 19263, 9071, 3603, 16804, 13479, 14044, 6817, 6845, 10493, 615, 13627, 12557, 6232, 9691, 11448, 8279, 5649, 11298, 4353, 3467, 6191, 12851, 5337, 8372, 7445, 11032, 6525, 5239, 6195, 12584, 15415, 15917, 12591, 9743, 10324)
f_setRDS(xxCreditCards)
bb <- f_getRDS(xxCreditCards)
mean_bb <- mean(bb)
sd_bb <- sd(bb)
dof_bb <- length(bb) - 1L
# #t-value for confidence interval 95% | (1-alpha) = 0.95 | alpha = 0.05 | alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
dof_ii <- dof_bb
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ", 
           format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.9949 (dof = 69)
#
# #Interval Estimate
err_margin_bb <- t_ii * sd_bb / sqrt(length(bb))
est_l <- mean_bb - err_margin_bb
est_r <- mean_bb + err_margin_bb
#
cat(paste0("Normal Sample (n=", length(bb), ", mean=", mean_bb, ", sd=", round(sd_bb, 1),
           "):\n Point Estimate = ", mean_bb, ", Margin of error = ", round(err_margin_bb, 1), 
           ", ", (1-2*p_r_ii) * 100, "% confidence interval is [", 
           round(est_l, 1), ", ", round(est_r, 1), "]"))
## Normal Sample (n=70, mean=9312, sd=4007):
##  Point Estimate = 9312, Margin of error = 955.4, 95% confidence interval is [8356.6, 10267.4]

“ForLater”

  • Determining the Sample Size
  • Population Proportion

Validation


31 Hypothesis Tests

31.1 Overview

31.2 Hypothesis Testing

Definition 31.1 Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.

Note:

  • Hypothesis testing is used to determine whether a statement about the value of a population parameter should or should not be rejected.
  • It is the process to check whether the sample information is matching with population information.
  • The hypothesis testing procedure uses data from a sample to test the two competing statements indicated by \({H_0}\) and \({H_a}\)
Definition 31.2 Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.
Definition 31.3 Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.

Note:

  • The conclusion that the alternative hypothesis \((H_a)\) is true is made if the sample data provide sufficient evidence to show that the null hypothesis \((H_0)\) can be rejected.
  • The null and alternative hypotheses are competing statements about the population. Either the null hypothesis \({H_0}\) is true or the alternative hypothesis \({H_a}\) is true, but not both.

31.3 Developing Null and Alternative Hypotheses

All hypothesis testing applications involve collecting a sample and using the sample results to provide evidence for drawing a conclusion.

In some situations it is easier to identify the alternative hypothesis first and then develop the null hypothesis.

  • The Alternative Hypothesis as a Research Hypothesis
    • A new fuel injection system designed to increase the miles-per-gallon rating from the current value 24 miles per gallon.
      • \(H_a : \mu > 24 \iff H_0: \mu \leq 24\)
    • A new teaching method is developed that is believed to be better than the current method.
      • \(H_a : \text{\{New method is better}\} \iff H_0: \text{\{New method is NOT better}\}\)
    • A new sales force bonus plan is developed in an attempt to increase sales.
      • \(H_a : \text{\{New plan increases sales}\} \iff H_0: \text{\{New plan does not increase sales}\}\)
    • A new drug is developed with the goal of lowering blood pressure more than an existing drug.
      • \({H_a}\) : New drug lowers blood pressure more than the existing drug
      • \({H_0}\) : New drug does not provide lower blood pressure than the existing drug
    • In each case, rejection of the null hypothesis \({H_0}\) provides statistical support for the research hypothesis \({H_a}\).
  • The Null Hypothesis as an Assumption to Be Challenged
    • The null hypothesis \({H_0}\) expresses the belief or assumption about the value of the population parameter. The alternative hypothesis \({H_a}\) is that the belief or assumption is incorrect.
    • Ex: The label on a soft drink bottle states that it contains 67.6 fluid ounces.
      • We consider the label correct provided the population mean filling weight for the bottles is at least 67.6 fluid ounces.
      • Without any reason to believe otherwise, we would give the manufacturer the benefit of the doubt and assume that the statement provided on the label is correct.
      • \(H_0 : \mu \geq 67.6 \iff H_a: \mu < 67.6\)
      • If the sample results lead to the conclusion to reject \({H_0}\), the inference that \(H_a: \mu < 67.6\) is true can be made. With this statistical support, the agency is justified in concluding that the label is incorrect and underfilling of the bottles is occurring. Appropriate action to force the manufacturer to comply with labeling standards would be considered.
      • However, if the sample results indicate \({H_0}\) cannot be rejected, the assumption that the labeling is correct cannot be rejected. With this conclusion, no action would be taken.
      • A product information is usually assumed to be true and stated as the null hypothesis. The conclusion that the information is incorrect can be made if the null hypothesis is rejected.
    • Same situation, from the point of view of the manufacturer
      • The company does not want to underfill the containers (legal requirement). However, the company does not want to overfill containers either because it would be an unnecessary cost.
      • \(H_0 : \mu = 67.6 \iff H_a: \mu \neq 67.6\)
      • If the sample results lead to the conclusion to reject \({H_0}\), the inference is made that \(H_a: \mu \neq 67.6\) is true. We conclude that the bottles are not being filled properly and the production process should be adjusted.
      • However, if the sample results indicate \({H_0}\) cannot be rejected, the assumption that the process is functioning properly cannot be rejected. In this case, no further action would be taken.

31.4 Three forms of hypotheses

For hypothesis tests involving a population mean, we let \({\mu}_0\) denote the hypothesized value and we must choose one of the following three forms for the hypothesis test.

Alternative is One-Sided, if it states that a parameter is larger or smaller than the null value. Alternative is Two-sided, if it states that the parameter is different from the null value.

Definition 31.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)
Definition 31.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
Definition 31.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

Refer Equality in Hypothesis

Exercises

  • The manager of an automobile dealership is considering a new bonus plan designed to increase sales volume. Currently, the mean sales volume is 14 automobiles per month. The manager wants to conduct a research study to see whether the new bonus plan increases sales volume.
    • Solution: \(H_0 : \mu \leq 14 \iff H_a: \mu > 14\)
  • A director of manufacturing must convince management that a proposed manufacturing method reduces costs before the new method can be implemented. The current production method operates with a mean cost of 220 dollars per hour.
    • Solution: \(H_0 : \mu \geq 220 \iff H_a: \mu < 220\)

31.5 Type I and Type II Errors

Refer Type I and Type II Errors (B12)

Ideally the hypothesis testing procedure should lead to the acceptance of \({H_0}\) when \({H_0}\) is true and the rejection of \({H_0}\) when \({H_a}\) is true. Unfortunately, the correct conclusions are not always possible. Because hypothesis tests are based on sample information, we must allow for the possibility of errors.

(C09P01) Type-I $(\alpha)$ and Type-II $(\beta)$ Errors

Figure 31.1 (C09P01) Type-I \((\alpha)\) and Type-II \((\beta)\) Errors

Definition 31.7 The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).
Definition 31.8 The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).
Definition 31.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.

30.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting \({\alpha}\), that person is controlling the probability of making a Type I error.

  • Most common value are \({\alpha} = 0.05, 0.01\).
    • For example, a significance level of \({\alpha} = 0.05\) indicates a 5% risk of concluding that a difference exists when there is no actual difference.
    • Lower significance levels indicate that you require stronger evidence before you will reject the null hypothesis.
    • If the cost of making a Type I error is high, small values of \({\alpha}\) are preferred. Ex: \({\alpha} = 0.01\)
    • If the cost of making a Type I error is not too high, larger values of \({\alpha}\) are typically used. Ex: \(\alpha = 0.05\)
Definition 31.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.

Although most applications of hypothesis testing control for the probability of making a Type I error, they do not always control for the probability of making a Type II error. Hence, if we decide to accept \({H_0}\), we cannot determine how confident we can be with that decision. Because of the uncertainty associated with making a Type II error when conducting significance tests, statisticians usually recommend that we use the statement "do not reject \({H_0}\)" instead of “accept \({H_0}\).” Using the statement “do not reject \({H_0}\)” carries the recommendation to withhold both judgment and action. In effect, by not directly accepting \({H_0}\), the statistician avoids the risk of making a Type II error.

31.5.1 Additional

Refer figure 31.1

  1. Type I (\({\alpha}\)):
    • False Positive: Rejecting a True \({H_0}\) thus claiming False \({H_a}\)
    • An alpha error is when you mistakenly reject the Null and believe that something significant happened
      • i.e. you believe that the means of the two populations are different when they are not
      • i.e. you report that your findings are significant when in fact they have occurred by chance
    • The probability of making a type I error is represented by alpha level \({\alpha}\), which is the p-value below which you reject the null hypothesis
      • The p-value is the actual risk you have in being wrong if you reject the null
        • You would like that to be low
        • This p-value is compared with and should be lower than the alpha
        • A p-value of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis. You can reduce your risk of committing a type I error by using a lower value for p. For example, a p-value of 0.01 would mean there is a 1% chance of committing a Type I error.
        • However, using a lower value for alpha means that you will be less likely to detect a true difference if one really exists (thus risking a type II error).
    • \({\alpha}\) is Significance Level (for \((1-{\alpha})\) confidence of not committing Type 1 error)
      • It is the boundary for specifying a statistically significant finding when interpreting the p-value
    • NOTE: Fail to reject True \({H_0}\) (\(\approx\) accept) is the correct decision shown in Top Left Quadrant
  2. Type II (\({\beta}\)):
    • False Negative: Failing to reject (\(\approx\) accept) a False \({H_0}\)
    • A beta error is when you fail to reject the null when you should have
      • i.e. you missed something significant and failed to take action
      • i.e. you conclude that there is not a significant effect, when actually there really is
      • You can decrease your risk of committing a type II error by ensuring your test has enough power.
      • You can do this by ensuring your sample size is large enough to detect a practical difference when one truly exists.

31.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).

  • The consequences of making a type I error mean that changes or interventions are made which are unnecessary, and thus waste time, resources, etc.
  • Type II errors typically lead to the preservation of the status quo (i.e. interventions remain the same) when change is needed.
  • Generally max 5% \({\alpha}\) and max 20% \({\beta}\) errors are recommended

31.6 Known SD

31.6.1 Test Statistic

Definition 31.11 Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.

27.4 The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.

The test statistic summarizes the observed data into a single number using the central tendency, variation, sample size, and number of predictor variables in the statistical model. Refer Table 31.1

Table 31.1: (C09V01) Test Statistic
Test statistic \({H_0}\) and \({H_a}\) Statistical tests that use it
t-value Null: The means of two groups are equal T-test, Regression tests
Alternative: The means of two groups are not equal
z-value Null: The means of two groups are equal Z-test
Alternative:The means of two groups are not equal
F-value Null: The variation among two or more groups is greater than or equal to the variation between the groups ANOVA, ANCOVA, MANOVA
Alternative: The variation among two or more groups is smaller than the variation between the groups
\({\chi}^2\text{-value}\) Null: Two samples are independent Chi-squared test, Non-parametric correlation tests
Alternative: Two samples are not independent (i.e. they are correlated)

31.6.2 Tails

25.17 A tail refers to the tapering sides at either end of a distribution curve.

Definition 31.12 A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.

One tailed-tests are concerned with one side of a statistic. Thus, one-tailed tests deal with only one tail of the distribution, and the z-score is on only one side of the statistic. Whereas, Two-tailed tests deal with both tails of the distribution, and the z-score is on both sides of the statistic.

In a one-tailed test, the area under the rejection region is equal to the level of significance, \({\alpha}\). When the rejection region is below the acceptance region, we say that it is a left-tail test. Similarly, when the rejection region is above the acceptance region, we say that it is a right-tail test.

In the two-tailed test, there are two critical regions, and the area under each region is \(\frac{\alpha}{2}\).

One-Tail vs. Two-Tail

  • One-tailed tests have more statistical power to detect an effect in one direction than a two-tailed test with the same design and significance level.
    • One-tailed tests occur most frequently for studies where one of the following is true:
      • Effects can exist in only one direction.
      • Effects can exist in both directions but the researchers only care about an effect in one direction.
  • The disadvantage of one-tailed tests is that they have no statistical power to detect an effect in the other direction.
    • Whereas, A two-tailed hypothesis test is designed to show whether the sample mean is significantly greater than OR significantly less than the mean of a population.
      • A two-tailed test is designed to examine both sides of a specified data range as designated by the probability distribution involved.
  • Thumb rule
    • Consider both directions when deciding if you should run a one tailed test or two. If you can skip one tail and it is not irresponsible or unethical to do so, then you can run a one-tailed test.
    • Two-tail test is done when you do not know about direction, so you test for both sides.

31.6.3 One-tailed Test

One-tailed tests about a population mean take one of the following two forms:

31.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)

31.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

Definition 31.13 One-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution.

Example: The label on a can of Hilltop Coffee states that the can contains 3 pounds of coffee. As long as the population mean filling weight is at least 3 pounds per can, the rights of consumers will be protected. Thus, the government (FTC) interprets the label information on a large can of coffee as a claim by Hilltop that the population mean filling weight is at least 3 pounds per can.

  • Develop the null and alternative hypotheses for the test
    • \(H_0 : \mu \geq 3 \iff H_a: \mu < 3\)
  • Take a Sample
    • Suppose a sample of 36 cans of coffee is selected and the sample mean \({\overline{x}}\) is computed as an estimate of the population mean \({\mu}\). If the value of the sample mean \({\overline{x}}\) is less than 3 pounds, the sample results will cast doubt on the null hypothesis.
    • What we want to know is how much less than 3 pounds must \({\overline{x}}\) be before we would be willing to declare the difference significant and risk making a Type I error by falsely accusing Hilltop of a label violation. A key factor in addressing this issue is the value the decision maker selects for the level of significance.
  • Specify the level of significance \({\alpha}\)
    • FTC is willing to risk a 1% chance of making such an error i.e. \(\alpha 0.01\)
  • Compute the value of test statistic
    • Assume, known \({\sigma} = 0.18\) and Normal distribution
    • Refer equation (29.1), standard error of \({\overline{x}}\) is \({\sigma}_{\overline{x}} = \frac{{\sigma}}{\sqrt{n}} = \frac{0.18}{\sqrt{36}} = 0.03\)
    • Because the sampling distribution of \({\overline{x}}\) is normally distributed, \(z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - 3}{0.03}\)
    • Because the sampling distribution of x is normally distributed, the sampling distribution of \({z}\) is a standard normal distribution.
    • A value of \(z = −1\) means that the value of \({\overline{x}}\) is one standard error below the hypothesized value of the mean. For a value of \(z = −2\), it would be two standard errors below the mean, and so on.
    • We can use the standard normal probability table to find the lower tail probability \({P_{\left(z\right)}}\) corresponding to any \({z}\) value. Refer Get P(z) by pnorm() or z by qnorm()
      • Ex: \(P_{\left(z = -3\right)} = 0.0013\)
      • As a result, the probability of obtaining a value of \({\overline{x}}\) that is 3 or more standard errors below the hypothesized population mean \({\mu}_0 = 3\) is also 0.0013. i.e. Such a result is unlikely if the null hypothesis is true.
Definition 31.14 If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (31.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)

\[z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}} \tag{31.1}\]

The key question for a lower tail test is, How small must the test statistic \({z}\) be before we choose to reject the null hypothesis

Two approaches can be used to answer this: the p-value approach and the critical value approach.

31.6.3.1 p-value approach

Definition 31.15 The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.
Definition 31.16 A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).

p-value (p) is the probability of obtaining a result equal to or more extreme than was observed in the data. It is the probability of observing the result given that the null hypothesis is true. A small p-value indicates the value of the test statistic is unusual given the assumption that \({H_0}\) is true.

For a lower tail test, the p-value is the probability of obtaining a value for the test statistic as small as or smaller than that provided by the sample. - we use the standard normal distribution to find the probability that \({z}\) is less than or equal to the value of the test statistic. - After computing the p-value, we must then decide whether it is small enough to reject the null hypothesis; this decision involves comparing the p-value to the level of significance.

For the Hilltop Coffee Example

  • Suppose the sample of 36 Hilltop coffee cans provides a sample mean of \({\overline{x}}\) = 2.92 pounds.
    • Is \(\overline{x} = 2.92\) small enough to cause us to reject \({H_0}\)
  • Because this is a lower tail test, the p-value is the area under the standard normal curve for values of \({z}\) less than or equal to the value of the test statistic.
    • Refer equation (31.1), \(z = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}} = \frac{2.92 - 3}{0.18/\sqrt{36}} = -2.67\)
    • Thus, the p-value is the probability that \({z}\) is less than or equal to −2.67 (the lower tail area corresponding to the value of the test statistic).
  • Refer Get P(z) by pnorm() or z by qnorm(), to get the p-value
    • \(P_{\left(\overline{x} = 2.92\right)} = P_{\left(z = -2.67\right)} = 0.0038\)
    • This p-value does not provide much support for the null hypothesis, but is it small enough to cause us to reject \({H_0}\)
  • Compare p-value with Level of significance \(\alpha = 0.01\)
    • Because .0038 is less than or equal to \(\alpha = 0.01\), we reject \({H_0}\). Therefore, we find sufficient statistical evidence to reject the null hypothesis at the .01 level of significance.
    • We can conclude that Hilltop is underfilling the cans.

Rejection Rule: Reject \({H_0}\) if p-value \(\leq {\alpha}\)

Further, in this case, we would reject \({H_0}\) for any value of \({\alpha} \geq (p = 0.0038)\). For this reason, the p-value is also called the observed level of significance.

  • (Aside)
    • For the p-value approach, the likelihood (p-value) of the numerical value of the test statistic is compared to the specified significance level (\({\alpha}\)) of the hypothesis test.
    • The p-value corresponds to the probability of observing sample data at least as extreme as the actually obtained test statistic. Small p-values provide evidence against the null hypothesis. The smaller (closer to 0) the p-value, the stronger is the evidence against the null hypothesis.
    • “If the null hypothesis is true, what is the probability that we would observe a more extreme test statistic in the direction of the alternative hypothesis than we did”
    • Ex: (criminal trials) “If the defendant is innocent, what is the chance that we would observe such extreme criminal evidence”
    • pnorm() returns the cumulative probability up to q (i.e. \({\overline{x}}\)) for a normal distribution with a given mean \({\mu}\) and standard deviation \({\sigma}\).

31.6.3.2 Critical value approach

Definition 31.17 The critical value approach requires that we first determine a value for the test statistic called the critical value.
Definition 31.18 Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).

For a lower tail test, the critical value serves as a benchmark for determining whether the value of the test statistic is small enough to reject the null hypothesis. - Critical value is the value of the test statistic that corresponds to an area of \({\alpha}\) (the level of significance) in the lower tail of the sampling distribution of the test statistic. - In other words, the critical value is the largest value of the test statistic that will result in the rejection of the null hypothesis.

Hilltop Coffee Example

  • The sampling distribution for the test statistic \({z}\) is a standard normal distribution.
    • Therefore, the critical value is the value of the test statistic that corresponds to an area of \(\alpha = 0.01\) in the lower tail of a standard normal distribution.
    • Using the standard normal probability table, we find that \(P_{\left(z\right)} = 0.01\) for \(z_{\alpha = 0.01} = −2.33\)
    • Refer Get P(z) by pnorm() or z by qnorm()
    • Thus, if the sample results in a value of the test statistic that is less than or equal to −2.33, the corresponding p-value will be less than or equal to .01; in this case, we should reject the null hypothesis.
  • Compare test statistic with z-value
    • Because \((z = -2.67) < (z_{\alpha = 0.01} = −2.33)\), we can reject \({H_0}\)
    • We can conclude that Hilltop is underfilling the cans.

Rejection Rule: Reject \({H_0}\) if \(z \leq z_{\alpha}\)

31.6.3.3 Summary

The p-value approach to hypothesis testing and the critical value approach will always lead to the same rejection decision; that is, whenever the p-value is less than or equal to \({\alpha}\), the value of the test statistic will be less than or equal to the critical value.

  • The advantage of the p-value approach is that the p-value tells us how significant the results are (the observed level of significance).
    • If we use the critical value approach, we only know that the results are significant at the stated level of significance.

For upper tail test The test statistic \({z}\) is still computed as earlier. But, for an upper tail test, the p-value is the probability of obtaining a value for the test statistic as large as or larger than that provided by the sample. Thus, to compute the p-value for the upper tail test in the \({\sigma}\) known case, we must use the standard normal distribution to find the probability that \({z}\) is greater than or equal to the value of the test statistic. Using the critical value approach causes us to reject the null hypothesis if the value of the test statistic is greater than or equal to the critical value \(z_{\alpha}\); in other words, we reject \({H_0}\) if \(z \geq z_{\alpha}\).

31.6.3.4 Acceptance and Rejection Region

30.1 Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)

Definition 31.19 A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.

\[Z = \frac {{\overline{x}} - {\mu}}{{\sigma}/{\sqrt{n}}} \quad \iff {\mu} = {\overline{x}} - Z \frac{{\sigma}}{\sqrt{n}} \quad \to {\mu} = {\overline{x}} \pm Z \frac{{\sigma}}{\sqrt{n}} \quad \to {\mu} \approx {\overline{x}} \pm Z \frac{{s}}{\sqrt{n}} \tag{31.2}\]

Definition 31.20 The margin of error tells how far the original population means might be from the sample mean. It is given by \(Z\frac{{\sigma}}{\sqrt{n}}\)
Definition 31.21 A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.

31.6.4 Two-tailed Test

31.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

Definition 31.22 Two-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution.

Ex: Golf Company, mean driving distance is 295 yards i.e. \(({\mu}_0 = 295)\)

  • \(H_0 : \mu = 295 \iff H_a: \mu \neq 295\)
  • The quality control team selected \(\alpha = 0.05\) as the level of significance for the test.
  • From previous tests, assume known \({\sigma} = 12\)
  • For a sample size \(n = 50\)
    • Standard Error of \({\overline{x}}\) is \({\sigma}_{\overline{x}} = \frac{{\sigma}}{\sqrt{n}} = \frac{12}{\sqrt{50}} = 1.7\)
    • Central Limit Theorem, allows us to conclude that the sampling distribution of \({\overline{x}}\) can be approximated by a normal distribution.
  • Suppose for the sample, \(\overline{x} = 297.6\)

“ForLater” - This part needs to be moved to the Next Chapter.

  • p-value approach
    • For a two-tailed test, the p-value is the probability of obtaining a value for the test statistic as unlikely as or more unlikely than that provided by the sample.
    • Refer equation (31.1), \(z = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}} = \frac{297.6 - 295}{12/\sqrt{50}} = 1.53\)
    • Now to compute the p-value we must find the probability of obtaining a value for the test statistic at least as unlikely as \(z = 1.53\).
      • Clearly values of \(z \geq 1.53\) are at least as unlikely.
      • But, because this is a two-tailed test, values of \(z \leq −1.53\) are also at least as unlikely as the value of the test statistic provided by the sample.
    • Refer Get P(z) by pnorm() or z by qnorm(), to get the p-value
      • \(P_{\left(z\right)} = P_{\left(z \leq -1.53\right)} + P_{\left(z \geq 1.53\right)}\)
      • \(P_{\left(z\right)} = 2 \times P_{\left(z \geq 1.53\right)}\), Because the normal curve is symmetric
      • \(P_{\left(z\right)} = 2 \times 0.0630 = 0.1260\)
    • Compare p-value with Level of significance \(\alpha = 0.05\)
      • We do not reject \({H_0}\) because the \((\text{p-value}= 0.1260) > (\alpha = 0.05)\)
      • Because the null hypothesis is not rejected, no action will be taken.
  • critical value approach
    • The critical values for the test will occur in both the lower and upper tails of the standard normal distribution.
    • With a level of significance of \(\alpha = 0.05\), the area in each tail corresponding to the critical values is \(\alpha/2 = 0.025\).
    • Refer Get P(z) by pnorm() or z by qnorm()
      • Using the standard normal probability table, we find that \(P_{\left(z\right)} = 0.025\) for \(-z_{\alpha/2 = 0.025} = −1.96\) and \(z_{\alpha/2 = 0.025} = 1.96\)
    • Compare test statistic with z-value
      • Because \((z = 1.53)\) is NOT greater than \((z_{\alpha/2 = 0.025} = 1.96)\), we cannot reject \({H_0}\)

Rejection Rule: Reject \({H_0}\) if \(z \leq -z_{\alpha/2}\) or \(z \geq z_{\alpha/2}\)

(Online, might be wrong) Ex: Assume that for a Population with mean \({\mu}\) unknown and standard deviation \({\sigma} = 15\), if we take a sample \({n = 100}\) its sample mean is \({\overline{x}} = 42\).

Assume \({\alpha} = 0.05\) and if we are conducting a Two Tail Test, \(Z_{\alpha/2 = 0.05/2} = 1.960\)

  • If we take a different sample of same size or a sample of different size, the sample mean calculated for those would be different.
  • So, our sample mean \({\overline{x}}\) might not be the true population mean \({\mu}\)
  • Thus, a range is inferred using the sample size, the sample mean, and the population standard deviation, and it is assumed that the true population means falls under this interval. This interval is called a confidence interval.
  • Confidence interval is calculated using critical limit \({z}\), and thus are calculated for specific significance level \({\alpha}\)
  • Margin of Error \(= Z\frac{{\sigma}}{\sqrt{n}} = 1.96 \times 15 /\sqrt{100} = 2.94\)

As shown in the equation (31.2), our interval range is \(\mu = \overline{X} \pm 2.94 = 42 \pm 2.94 \rightarrow \mu \in (39.06, 44.94)\)

We are 95% confident that the population mean will be between 39.04 and 44.94

Note that a 95% confidence interval does not mean there is a 95% chance that the true value being estimated is in the calculated interval. Rather, given a population, there is a 95% chance that choosing a random sample from this population results in a confidence interval which contains the true value being estimated.

31.7 Steps of Hypothesis Testing

Common Steps

  1. Develop the null and alternative hypotheses.
  2. Specify the level of significance.
  3. Collect the sample data and compute the value of the test statistic.

p-Value Approach Step

  1. Use the value of the test statistic to compute the p-value.
  2. Reject \({H_0}\) if the p-value \(\leq {\alpha}\).
  3. Interpret the statistical conclusion in the context of the application.
Definition 31.23 p-value Approach: Form Hypothesis | Specify \({\alpha}\) | Calculate test statistic | Calculate p-value | Compare p-value with \({\alpha}\) | Interpret

Critical Value Approach

  1. Use the level of significance to determine the critical value and the rejection rule.
  2. Use the value of the test statistic and the rejection rule to determine whether to reject \({H_0}\).
  3. Interpret the statistical conclusion in the context of the application.

31.8 Relationship Between Interval Estimation and Hypothesis Testing

Refer equation (30.1), For the \({\sigma}\) known case, the \({(1 - \alpha)}\%\) confidence interval estimate of a population mean is given by

\[\overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}\]

We know that \(100 {(1 - \alpha)}\%\) of the confidence intervals generated will contain the population mean and \(100 {\alpha}\%\) of the confidence intervals generated will not contain the population mean.

Thus, if we reject \({H_0}\) whenever the confidence interval does not contain \({\mu}_0\), we will be rejecting the null hypothesis when it is true \((\mu = {\mu}_0)\) with probability \({\alpha}\).

The level of significance is the probability of rejecting the null hypothesis when it is true. So constructing a \(100 {(1 - \alpha)}\%\) confidence interval and rejecting \({H_0}\) whenever the interval does not contain \({\mu}_0\) is equivalent to conducting a two-tailed hypothesis test with \({\alpha}\) as the level of significance.

Ex: Golf company

  • For \({\alpha} = 0.05\), 95% confidence interval estimate of the population mean is
    • \({\overline{x}} \pm z_{0.025} \frac{{\sigma}}{\sqrt{n}} = 297.6 \pm 1.96 \frac{12}{\sqrt{50}} = 297.6 \pm 3.3\)
    • Interval: \([294.3, 300.9]\)
    • We can conclude with 95% confidence that the mean distance for the population of golf balls is between 294.3 and 300.9 yards.
    • Because the hypothesized value for the population mean, \({\mu}_0 = 295\), is in this interval, the hypothesis testing conclusion is that the null hypothesis, \({H_0: {\mu} = 295}\), cannot be rejected.

“ForLater” - Exercises

31.9 Unknown SD

Definition 31.24 If \({\sigma}\) is unknown, the sampling distribution of the test statistic follows the t distribution with \((n − 1)\) degrees of freedom. Refer equation (31.3) \(\to t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}}\)

\[t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}} \tag{31.3}\]

One-Tailed Test

  • Ex: Heathrow Airport, testing for mean rating 7 i.e. \({\mu}_0 = 7\)
    • \({H_0}: {\mu} \leq 7 \iff {H_a} \geq 7\)
    • Sample: \({\overline{x}} = 7.25, s = 1.052, n = 60\)
    • \({\alpha} = 0.05\)
    • Refer equation (31.3), \(t = \frac{\overline{x} - {\mu}_0}{s/\sqrt{n}} = \frac{7.25 - 7}{1.052/\sqrt{60}} = 1.84\)
    • \(\text{DOF} = n-1 = 60 -1 = 59\)
    • Refer For P(t), find t by qt() and This is a Right Tail Test
      • \({P_{\left(t \geq 1.84\right)}} = 0.0354\) i.e. between 0.05 and 0.025
    • Comparison
      • \({(P_{\left(t \geq 1.84\right)}} = 0.035) < ({\alpha} = 0.05)\)
      • Thus, we can reject the \({H_0}\) and can accept the \({H_a}\)

Critical Value Approach - \((\text{DOF = 59}), \, t_{{\alpha} = 0.05} = 1.671\) - Because \((t = 1.84) > (t_{{\alpha} = 0.05} = 1.671)\), Reject \({H_0}\)

# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
#
# #p-value approach: Find Commulative Probability P corresponding to the given t-value & DOF=59
pt(q = 1.84, df = 59, lower.tail = FALSE)
## [1] 0.03539999
#
# #Critical Value: t-value for which Area under the curve towards Right is alpha=0.05 & DOF=59
qt(p = 0.05, df = 59, lower.tail = FALSE)
## [1] 1.671093

Two Tailed Test

  • Ex: Holiday Toys, testing for sale of 40 units, i.e. \({\mu}_0 = 40\)
    • \({H_0}: {\mu} = 40 \iff {H_a} \neq 40\)
    • Sample: \({\overline{x}} = 37.4, s = 11.79, n = 25\)
    • \({\alpha} = 0.05\)
    • Refer equation (31.3), \(t = \frac{\overline{x} - {\mu}_0}{s/\sqrt{n}} = \frac{37.4 - 40}{11.79/\sqrt{25}} = -1.10\)
    • \(\text{DOF} = n-1 = 25 -1 = 24\)
    • Because we have a two-tailed test, the p-value is two times the area under the curve of the t distribution for \(t \leq -1.10\)
      • \(P_{\left(t\right)} = P_{\left(t \leq -1.10\right)} + P_{\left(z \geq 1.10\right)}\)
      • \(P_{\left(t\right)} = 2 \times P_{\left(t \leq -1.10\right)}\), Because the normal curve is symmetric
      • \(P_{\left(t\right)} = 2 \times 0.1411 = 0.2822\) i.e. between 2 * (0.20 and 0.10) or (0.40, 0.20)
    • Comparison
      • \((P_{\left(t\right)} = 0.282)2 > ({\alpha} = 0.05)\)
      • Thus, we cannot reject the \({H_0}\)

Critical Value Approach - \((\text{DOF = 24})\) - We find that \(P_{\left(t\right)} = 0.025\) for \(-t_{\alpha/2 = 0.025} = -2.064\) and \(t_{\alpha/2 = 0.025} = 2.064\) - Compare test statistic with z-value - Because \((t = -1.10)\) is NOT lower than \((-z_{\alpha/2 = 0.025} = -2.064)\), we cannot reject \({H_0}\)

31.10 Population Proportions

31.10.1 Hypothesis

Using \({p}_0\) to denote the hypothesized value for the population proportion, the three forms for a hypothesis test about a population proportion \({p}\) are :

Definition 31.25 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {p} \geq {p}_0 \iff {H_a}: {p} < {p}_0\)
Definition 31.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)
Definition 31.27 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {p} = {p}_0 \iff {H_a}: {p} \neq {p}_0\)

Hypothesis tests about a population proportion are based on the difference between the sample proportion \({\overline{p}}\) and the hypothesized population proportion \({p}_0\)

The sampling distribution of \({\overline{p}}\), the point estimator of the population parameter \({p}\), is the basis for developing the test statistic.

When the null hypothesis is true as an equality, the expected value of \({\overline{p}}\) equals the hypothesized value \({p}_0\) i.e. \(E_{(\overline{p})} = {p}_0\)

The standard error of \({\overline{p}}\) is given in equation (31.4)

\[{\sigma}_{\overline{p}} = \sqrt{\frac{{p}_0 (1 - {p}_0)}{n}} \tag{31.4}\]

If \(np \geq 5\) and \(n(1 − p) \geq 5\), the sampling distribution of \({p}\) can be approximated by a normal distribution. Under these conditions, which usually apply in practice, the quantity \({z}\) as given in equation (31.5) has a standard normal probability distribution.

Test Statistic for Hypothesis Tests about a Population Proportion :

\[z = \frac{{\overline{p}} - {p}_0}{{\sigma}_{\overline{p}}} = \frac{{\overline{p}} - {p}_0}{\sqrt{\frac{{p}_0 (1 - {p}_0)}{n}}} \tag{31.5}\]

Example: Pine Creek: Determine whether the proportion of women golfers increased from \(p_0 = 0.20\)

31.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)

  • Count of Success \(({x})\) is Number of Women
  • \(\{n = 400, x = 100\} \to {\overline{p}} = {n}/{x} = 0.25\)
  • (31.5) \(z = \frac{{\overline{p}} - {p}_0}{{\sigma}_{\overline{p}}} = \frac{{\overline{p}} - {p}_0}{\sqrt{\frac{{p}_0 (1 - {p}_0)}{n}}}\)
    • \(z = \frac{0.25 - 0.20}{\sqrt{\frac{0.20 (1 - 0.20)}{400}}} = 2.50\)
      • {0.25 - 0.20}/{sqrt(0.20 * {1 - 0.20} / 400)} \(\#\mathcal{R}\)
  • \({}^U\!P_{(z = 2.50)} = 0.0062\)
    • pnorm(q = 2.50, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(z)} < {\alpha} \to {H_0}\) is rejected i.e. the proportions are different
    • We can conclude that the proportion of women players has increased.

31.11 Hypothesis Testing and Decision Making

If the purpose of a hypothesis test is to make a decision when \({H_0}\) is true and a different decision when \({H_a}\) is true, the decision maker may want to, and in some cases be forced to, take action with both the conclusion do not reject \({H_0}\) and the conclusion reject \({H_0}\).

If this situation occurs, statisticians generally recommend controlling the probability of making a Type II error. With the probabilities of both the Type I and Type II error controlled, the conclusion from the hypothesis test is either to accept \({H_0}\) or reject \({H_0}\). In the first case, \({H_0}\) is concluded to be true, while in the second case, \({H_a}\) is concluded true. Thus, a decision and appropriate action can be taken when either conclusion is reached.

“ForLater” - Calculate \({\beta}\)

When the true population mean \({\mu}\) is close to the null hypothesis value of \({\mu} = 120\), the probability is high that we will make a Type II error. However, when the true population mean \({\mu}\) is far below the null hypothesis value of \({\mu} = 120\), the probability is low that we will make a Type II error.

Definition 31.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 − \beta\).
Definition 31.29 Power Curve is a graph of the probability of rejecting \({H_0}\) for all possible values of the population parameter \({\mu}\) not satisfying the null hypothesis. It provides the probability of correctly rejecting the null hypothesis.

Note that the power curve extends over the values of \({\mu}\) for which the null hypothesis is false. The height of the power curve at any value of \({\mu}\) indicates the probability of correctly rejecting \({H_0}\) when \({H_0}\) is false.

31.12 Summary

We can make 3 observations about the relationship among \({\alpha}, \beta, n (\text{sample size})\).

  1. Once two of the three values are known, the other can be computed.
  2. For a given level of significance \({\alpha}\), increasing the sample size will reduce \({\beta}\).
  3. For a given sample size, decreasing \({\alpha}\) will increase \({\beta}\), whereas increasing \({\alpha}\) will decrease \({\beta}\).

Validation


32 Two Populations

32.1 Overview

32.2 Introduction

How interval estimates and hypothesis tests can be developed for situations involving two populations when the difference between the two population means or the two population proportions is of prime importance.

Example

  • To develop an interval estimate of the difference between the mean starting salary for a population of men and the mean starting salary for a population of women.
  • To conduct a hypothesis test to determine whether any difference is present between the proportion of defective parts in a population of parts produced by supplier A and the proportion of defective parts in a population of parts produced by supplier B.

32.3 Known SD: Two Population Means

Inferences About the Difference Between Two Population Means

Definition 32.1 Let \({\mathcal{N}}_{({\mu}_1, \, {\sigma}_1)}\) and \({\mathcal{N}}_{({\mu}_2, \, {\sigma}_2)}\) be the two populations. To make an inference about the difference between the means \(({\mu}_1 - {\mu}_2)\), we select a simple random sample of \({n}_1\) units from population 1 and a second simple random sample of \({n}_2\) units from population 2. The two samples, taken separately and independently, are referred to as independent simple random samples.

32.3.1 Interval Estimation

Interval Estimation of \(({\mu}_1 - {\mu}_2)\)

The point estimator of the difference between the two population means \(({\mu}_1 - {\mu}_2)\) is the difference between the two sample means \(({\overline{x}}_1 - {\overline{x}}_2)\). Thus, \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )}\) represents the difference of population means. It is given by equation (32.1)

Point Estimate :

\[E_{( {\overline{x}}_1 - {\overline{x}}_2 )} = {\overline{x}}_1 - {\overline{x}}_2 \tag{32.1}\]

As with other point estimators, the point estimator \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )}\) has a standard error \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}\), that describes the variation in the sampling distribution of the estimator. It is the standard deviation of the sampling distribution of \(({\overline{x}}_1 - {\overline{x}}_2)\). Refer equation (32.2)

Standard Error of \(({\overline{x}}_1 - {\overline{x}}_2)\) :

\[{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{32.2}\]

30.1 Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)

In the case of estimation of the difference between two population means, an interval estimate will take the following form: \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )} \, \pm \text{MOE}_{{\gamma}}\). Refer equation (32.3) and (32.4)

Margin of Error (\(\text{MOE}_{{\gamma}}\)) :

\[\text{MOE}_{{\gamma}} = {z}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = {z}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{32.3}\]

\(\text{Interval Estimate}_{\gamma}\) :

\[\text{Interval Estimate}_{\gamma} = ({\overline{x}}_1 - {\overline{x}}_2) \pm {z}_{\frac{{\alpha}}{2}} \sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{32.4}\]

Example: Greystone: Difference between the mean

32.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

  • (1: Inner ) \({n}_1 = 36, {\overline{x}}_1 = 40, {\sigma}_1 = 9\)
  • (2: Suburb) \({n}_2 = 49, {\overline{x}}_2 = 35, {\sigma}_2 = 10\)
  • For \({\gamma = 0.95} \iff{\alpha} = 0.05 \to {z_{{\alpha}/2}} = {z_{0.025}} = 1.96\)
    • qnorm(p = 0.025, lower.tail = FALSE) \(\#\mathcal{R}\)
  • (32.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} = \sqrt{\frac{{9}^2}{36} + \frac{{10}^2}{49}} = 2.0714\)
    • sqrt(9^2/36 + 10^2/49) \(\#\mathcal{R}\)
  • (32.3) \(\text{MOE}_{\gamma} = {z}_{\frac{{\alpha}}{2}} {\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = 1.96 * 2.071 = 4.06\)
  • (32.4) \(\text{Interval Estimate}_{\gamma} = ({\overline{x}}_1 - {\overline{x}}_2) \pm \text{MOE}_{\gamma} = (40 - 35) \pm 4.06 = 5 \pm 4.06\)

32.3.2 Hypothesis Tests

Using \({D_0}\) to denote the hypothesized difference between \({\mu}_1\) and \({\mu}_2\), the three forms for a hypothesis test are as follows:

Definition 32.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)
Definition 32.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
Definition 32.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

The test statistic for the difference between two population means when \({\sigma}_1\) and \({\sigma}_2\) are known is given in equation (32.5)

Test Statistic for Hypothesis Tests :

\[z = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{\sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}}} \tag{32.5}\]

Example

Example: Evaluate differences in education quality between two training centers

32.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

  • (1: A) \({n}_1 = 30, {\overline{x}}_1 = 82, {\sigma}_1 = 10\)
  • (2: B) \({n}_2 = 40, {\overline{x}}_2 = 78, {\sigma}_2 = 10\)
  • (32.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{10}^2}{30} + \frac{{10}^2}{40}} = 2.4152\)
    • sqrt(10^2/30 + 10^2/40) \(\#\mathcal{R}\)
  • (32.5) \(z = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}} = \frac{(82 - 78) - 0}{2.415} = 1.66\)
  • Calculate \({}^2\!P_{(z)}\)
    • Because z is in upper tail, we get upper tail area and because it is Two-Tail Test, we double it
    • \({}^2\!P_{(z = 1.66)} = 2 * {}^U\!P_{(z = 1.66)} = 2 * 0.0485 = 0.0970\)
      • 2 * pnorm(q = 1.66, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^2\!P_{(z)} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude the training centers differ in quality.

Comparison

# #Get P(z) for z = 1.66 (Two-Tail)
#
# #Get the default (lower), subtract from 1, Double if Two-Tail
ii <- 2 * {1 - pnorm(q = 1.66)}
jj <- 2 * {1 - pnorm(q = 1.66, lower.tail = TRUE)}
#
# #Use the symmetry i.e. 'minus z' value, Double if Two-Tail
kk <- 2 * pnorm(q = -1.66)
#
# #Use the actual Upper Tail Option, Double if Two-Tail
ll <- 2 * pnorm(q = 1.66, lower.tail = FALSE)
#
stopifnot(all(identical(round(ii, 7), round(jj, 7)), identical(round(ii, 7), round(kk, 7)),
              identical(round(ii, 7), round(ll, 7))))
ll
## [1] 0.09691445

Shapiro-Wilk test

Definition 32.5 The Shapiro-Wilk test is a test of normality. It tests the null hypothesis that a sample came from a normally distributed population. \(P_{\text{shapiro}} > ({\alpha} = 0.05) \to \text{Data is Normal}\). Avoid using sample with more than 5000 observations.

The null-hypothesis of this test is that the population is normally distributed. Thus, if the p value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data being tested is not distributed normally.

On the other hand, if the p value is greater than the chosen alpha level, then the null hypothesis (that the data came from a normally distributed population) cannot be rejected (e.g., for an alpha level of .05, a data set with a p value of less than .05 rejects the null hypothesis that the data are from a normally distributed population).

Shapiro-Wilk test is known not to work well in samples with many identical values.

ERROR 32.1 Error in shapiro.test(...) : sample size must be between 3 and 5000
  • Ideally we should not test for normality of a sample with more than 5000 observations.
    • However, we can randomly select 5000 observations and test them by Shapiro.
    • We can use Anderson-Darling test or Kolmogorov-Smirnov test (Nonparametric) (KS is weaker than AD)
    • Anderson-Darling test is not quite as good as the Shapiro-Wilk test, but is better than other tests.
  • (SO) Normality Testing
    • Normality Tests like Shapiro or Anderson Darling are NULL Hypothesis Tests against the assumption of normality.
      • When the sample size is small, even big departures from normality are not detected
      • When your sample size is large, even the smallest deviation from normality will lead to a rejected null.
        • In Shapiro-Wilk test, with more data, the chances of the null hypothesis being rejected becomes larger even though for practical purposes the data is more than normal enough.
    • Do not worry much about normality. The CLT takes over quickly and if you have all but the smallest sample sizes and an even remotely reasonable looking histogram you are fine.
    • Worry about unequal variances (heteroskedasticity). Heteroscedasticity-consistent standard errors are checked by HCCM tests. A scale location plot will give some idea of whether this is broken, but not always.
    • Also, there is no a priori reason to assume equal variances in most cases.
    • Further Outliers: A cooks distance of > 1 is reasonable cause for concern.

32.4 Exercises

01

32.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

  • (1:) \({n}_1 = 50, {\overline{x}}_1 = 13.6, {\sigma}_1 = 2.2\)
  • (2:) \({n}_2 = 35, {\overline{x}}_2 = 11.6, {\sigma}_2 = 3.0\)
  • What is the point estimate of the difference between the two population means
    • (32.1) \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )} = 13.6 - 11.6 = 2\)
  • (32.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{2.2}^2}{50} + \frac{{3}^2}{35}} = 0.5949\)
    • sqrt(2.2^2/50 + 3^2/35) \(\#\mathcal{R}\)
  • Provide a 90% confidence interval for the difference between the two population means
    • \({\gamma = 0.90} \iff{\alpha} = 0.10 \to {z_{{\alpha}/2}} = {z_{0.05}} = 1.6448\)
      • qnorm(p = 0.05, lower.tail = FALSE) \(\#\mathcal{R}\)
    • (32.3) \(\text{MOE}_{\gamma =0.90} = 1.6448 * 0.5949 = 0.9785\)
    • (32.4) \(\text{Interval Estimate}_{\gamma} = 2 \pm 0.9785\)
  • Provide a 95% confidence interval for the difference between the two population means
    • \({\gamma = 0.95} \iff{\alpha} = 0.05 \to {z_{{\alpha}/2}} = {z_{0.025}} = 1.96\)
    • \(\text{MOE}_{\gamma =0.95} = 1.96 * 0.5949 = 1.166\)
    • \(\text{Interval Estimate}_{\gamma} = 2 \pm 1.166\)

02

32.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

  • (1:) \({n}_1 = 40, {\overline{x}}_1 = 25.2, {\sigma}_1 = 5.2\)
  • (2:) \({n}_2 = 50, {\overline{x}}_2 = 22.8, {\sigma}_2 = 6.0\)
  • (32.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{5.2}^2}{40} + \frac{{6}^2}{50}} = 1.1815\)
    • sqrt(5.2^2/40 + 6^2/50) \(\#\mathcal{R}\)
  • What is the value of the test statistic
    • (32.5) \(z = \frac{(25.2 - 22.8) - 0}{1.18} = 2.03\)
  • What is the p-value
    • \({}^U\!P_{z = 2.03} = 0.0212\)
      • pnorm(q = 2.03, lower.tail = FALSE) \(\#\mathcal{R}\)
  • With \({\alpha} = 0.05\), what is your hypothesis testing conclusion
    • \({}^U\!P_{z} < {\alpha} \to {H_0}\) Rejected

04 Conde

32.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

  • (1: small) \({n}_1 = 37, {\overline{x}}_1 = 85.36, {\sigma}_1 = 4.55\)
  • (2: large) \({n}_2 = 44, {\overline{x}}_2 = 81.40, {\sigma}_2 = 3.97\)
  • (32.1) \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )} = 85.36 - 81.40 = 3.96\)
  • (32.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{4.55}^2}{37} + \frac{{3.97}^2}{44}} = 0.958\)
    • sqrt(4.55^2/37 + 3.97^2/44) \(\#\mathcal{R}\)
  • \({\gamma = 0.95} \iff{\alpha} = 0.05 \to {z_{{\alpha}/2}} = {z_{0.025}} = 1.96\)
    • qnorm(p = 0.025, lower.tail = FALSE) \(\#\mathcal{R}\)
  • (32.3), \(\text{MOE}_{\gamma =0.95} = 1.96 * 0.958 = 1.87768\)
  • (32.4), \(\text{Interval Estimate}_{\gamma} = 3.96 \pm 1.88\)
  • Test Statistic z
    • (32.5) \(z = \frac{(85.36 - 81.40 ) - 0}{0.958} = 4.13369\)
  • p-value
    • \({}^2\!P_{(z = 4.13369)} = 2 * {}^U\!P_{(z = 4.13369)} = 2 * 0.0000178 \approx 0\)
      • pnorm(q = 4.13369, lower.tail = FALSE) \(\#\mathcal{R}\)
    • \({}^2\!P_{z} < {\alpha} \to {H_0}\) Rejected

08 Rite Aid

32.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)

  • Will improving customer service result in higher stock prices
  • For each case: \({n} = 60, {\sigma} = 6, {\alpha} = 0.05\)
    • (32.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{6}^2}{60} + \frac{{6}^2}{60}} = 1.0954\)
      • sqrt(6^2/60 + 6^2/60) \(\#\mathcal{R}\)
  • Rite \(\{{\overline{x}}_1 = 73, {\overline{x}}_2 = 76\}\), Expedia \(\{{\overline{x}}_1 = 75, {\overline{x}}_2 = 77\}\), JC \(\{{\overline{x}}_1 = 77, {\overline{x}}_2 = 78\}\)
  • For Rite Aid, is the increase in the satisfaction score from year 1 to year 2 statistically significant
    • (32.5) \(z = \frac{(73 - 76) - 0}{1.0954} = -2.738613\)
    • \({}^L\!P_{(z = -2.738613)} = 0.003\)
      • pnorm(q = -2.738613, lower.tail = TRUE) \(\#\mathcal{R}\)
      • \({}^L\!P_{z} < {\alpha} \to {H_0}\) is rejected i.e. the increase is significant
      • Caution: Using Two-Tail Test will result in different result. The type of test should be considered carefully.

32.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

  • Can you conclude that the year 2 score for Rite Aid is above the national average of 75.7
    • 1 is 76, 2 is national 75.7
    • (32.5) \(z = \frac{(76 - 75.7) - 0}{1.0954} = 0.2739\)
    • \({}^U\!P_{(z = 0.2739)} = 0.392\)
      • pnorm(q = 0.2739, lower.tail = FALSE) \(\#\mathcal{R}\)
      • \({}^U\!P_{z} > {\alpha} \to {H_0}\) cannot be rejected. Difference is NOT significant.
      • Cannot conclude that the increase is above the national average

32.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)

  • For Expedia, is the increase from year 1 to year 2 statistically significant
    • (32.5) \(z = \frac{(75 - 77) - 0}{1.0954} = -1.826\)
    • \({}^L\!P_{(z = -1.826)} = 0.0339\)
      • pnorm(q = -1.826, lower.tail = TRUE) \(\#\mathcal{R}\)
      • \({}^L\!P_{z} < {\alpha} \to {H_0}\) is rejected i.e. the increase is significant
  • When conducting a hypothesis test with the values given for the standard deviation, sample size, and alpha, how large must the increase from year 1 to year 2 be for it to be statistically significant
    • \({\alpha} = 0.05 \iff {}^L\!P_{(z)} = 0.05 \to {z_{0.05}} = -1.6448\)
      • qnorm(p = 0.05, lower.tail = TRUE) \(\#\mathcal{R}\)
    • (32.3), \(\text{MOE}_{\gamma =0.95} = -1.6448 * 1.0954 = -1.8\)
    • At least 1.8 increase should be there for result to be significant
    • For JC, because the increase is only 1 i.e. less than 1.8, it will not be significant

32.5 Unknown SD: Two Population Means

Use the sample standard deviations, \({s}_1\) and \({s}_2\), to estimate the unknown population standard deviations \(({\sigma}_1, {\sigma}_2)\).

When \(({\sigma}_1, {\sigma}_2)\) are estimated by \(({s}_1, {s}_2)\), the t distribution is used to make inferences about the difference between two population means.

Interval Estimation of \(({\mu}_1 - {\mu}_2)\)

Margin of Error (\(\text{MOE}_{{\gamma}}\)) : Refer (32.6) like (32.3)

\[\text{MOE}_{{\gamma}} = {t}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = {t}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{32.6}\]

\(\text{Interval Estimate}_{\gamma}\) : Refer (32.7) like (32.4) using (32.2)

\[\text{Interval Estimate}_{\gamma} = ({\overline{x}}_1 - {\overline{x}}_2) \pm {t}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = ({\overline{x}}_1 - {\overline{x}}_2) \pm {t}_{\frac{{\alpha}}{2}}\sqrt{\frac{{s}_1^2}{{n}_1} + \frac{{s}_2^2}{{n}_2}} \tag{32.7}\]

Degrees of Freedom (DOF) : Refer (32.8)

\[\text{DOF} = \frac{ { \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right) }^2} {\frac{1}{n_1 - 1}{ \left( \frac{s_1^2}{n_1} \right) }^2 + \frac{1}{n_2 - 1}{ \left( \frac{s_2^2}{n_2} \right) }^2} \tag{32.8}\]

Example: Clearwater - To estimate the difference between the mean

32.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

  • (1: Cherry) \({n}_1 = 28, {\overline{x}}_1 = 1025, {s}_1 = 150\)
  • (2: Beech) \({n}_2 = 22, {\overline{x}}_2 = 910, {s}_2 = 125\)
  • (32.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{150}^2}{28} + \frac{{125}^2}{22}} = 38.9076\)
    • sqrt(150^2/28 + 125^2/22) \(\#\mathcal{R}\)
  • (32.8) \(\text{DOF} = 47\)
    • floor({150^2 / 28 + 125^2 / 22 }^2 / {{150^2 / 28}^2/{28-1} + {125^2 / 22}^2/{22-1}}) \(\#\mathcal{R}\)
  • \({\gamma = 0.95} \iff{\alpha} = 0.05 \to {{}^2\!t_{{\alpha}/2}} = {{}^2\!t_{0.025}} = 2.012\)
    • qt(p = 0.025, df = 47, lower.tail = FALSE) \(\#\mathcal{R}\)
  • (32.6) \(\text{MOE}_{\gamma =0.95} = 2.012 * 38.9076 = 78.3\)
  • (32.7) \(\text{Interval Estimate}_{\gamma} = (1025-910) \pm 78\)

Hypothesis Tests

Test Statistic for Hypothesis Tests : Refer (32.9) like (32.5) using (32.2)

\[t = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}} = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{\sqrt{\frac{{s}_1^2}{{n}_1} + \frac{{s}_2^2}{{n}_2}}} \tag{32.9}\]

Caution: Use of \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}\) symbol is probably wrong here because it should not represent formula containing \({s}_1, {s}_2\). “ForLater”

ERROR 32.2 Error in t.test.formula() : grouping factor must have exactly 2 levels
  • For t.test(): Formula needs to be ‘value ~ key,’ not ‘key ~ value’

Example

Software: To show that the new software will provide a shorter mean time

32.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

  • (1: Old) \({n}_1 = 12, {\overline{x}}_1 = 325, {s}_1 = 40\)
  • (2: New) \({n}_2 = 12, {\overline{x}}_2 = 286, {s}_2 = 44\)
  • (32.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{40}^2}{12} + \frac{{44}^2}{12}} = 17.1659\)
    • sqrt(40^2/12 + 44^2/12) \(\#\mathcal{R}\)
  • (32.8) \(\text{DOF} = 21\)
    • floor({40^2 / 12 + 44^2 / 12 }^2 / {{40^2 / 12}^2/{12-1} + {44^2 / 12}^2/{12-1}}) \(\#\mathcal{R}\)
  • (32.9) \(t = \frac{(325 - 286) - 0}{17.1659} = 2.272\)
  • Calculate \({}^U\!P_{(t)}\)
    • \({}^U\!P_{(t = 2.272)} = 0.0168\)
      • pt(q = 2.272, df = 21, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(t)} < {\alpha} \to {H_0}\) is rejected i.e. the decrease is significant
    • It supports the conclusion that the new software provides a smaller population mean.

Code

CONVERT HERE to Tilde Based Option, to take advantage of Column Headers

# #Software 
xxSoftware <- tibble(Old = c(300, 280, 344, 385, 372, 360, 288, 321, 376, 290, 301, 283), 
                     New = c(274, 220, 308, 336, 198, 300, 315, 258, 318, 310, 332, 263))
aa <- xxSoftware
# Summary
aa %>% 
  pivot_longer(everything(), names_to = "key", values_to = "value") %>% 
  group_by(key) %>% 
  summarise(across(value, list(Count = length, Mean = mean, SD = sd), .names = "{.fn}"))
## # A tibble: 2 x 4
##   key   Count  Mean    SD
##   <chr> <int> <dbl> <dbl>
## 1 New      12   286  44.0
## 2 Old      12   325  40.0
bb <- aa %>% pivot_longer(everything(), names_to = "key", values_to = "value")
#
# #Welch Two Sample t-test
# #Alternative must be: "two.sided" (Default), "less", "greater"
bb_ha <- "greater"
#bb_testT <- t.test(x = bb$Old, y = bb$New, alternative = bb_ha)
bb_testT <- t.test(formula = value ~ key, data = bb, alternative = bb_ha)
bb_testT
## 
##  Welch Two Sample t-test
## 
## data:  value by key
## t = -2.2721, df = 21.803, p-value = 0.9833
## alternative hypothesis: true difference in means between group New and group Old is greater than 0
## 95 percent confidence interval:
##  -68.48569       Inf
## sample estimates:
## mean in group New mean in group Old 
##               286               325
#
cat(paste0("t is the t-test statistic value (t = ", round(bb_testT$statistic, 6), ")\n"))
## t is the t-test statistic value (t = -2.272127)
cat(paste0("df is the degrees of freedom (df = ", round(bb_testT$parameter, 1), ")\n"))
## df is the degrees of freedom (df = 21.8)
cat(paste0("p-value is the significance level of the t-test (p-value = ", 
           round(bb_testT$p.value, 6), ")\n"))
## p-value is the significance level of the t-test (p-value = 0.98335)
cat(paste0("conf.int is the confidence interval of the mean at 95% (conf.int = [",
           paste0(round(bb_testT$conf.int, 3), collapse = ", "), "])\n"))
## conf.int is the confidence interval of the mean at 95% (conf.int = [-68.486, Inf])
cat(paste0("sample estimates is the mean value of the samples. Mean: ", 
           paste0(round(bb_testT$estimate, 2), collapse = ", "), "\n"))
## sample estimates is the mean value of the samples. Mean: 286, 325
#
# #Compare p-value with alpha = 0.05
alpha <- 0.05
if(any(all(bb_ha == "two.sided", bb_testT$p.value >= alpha / 2), 
       all(bb_ha != "two.sided", bb_testT$p.value >= alpha))) {
  cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is greater than alpha (", alpha, 
      "). We failed to reject H0. We cannot conclude that the populations are different.\n")) 
} else {
    cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is less than alpha (", alpha, 
      ").\nWe can reject the H0 with 95% confidence. The populations are different.\n"))
}
## p-value (0.98335) is greater than alpha (0.05). We failed to reject H0. We cannot conclude that the populations are different.

Pooled

Another approach used to make inferences about the difference between two population means when \({\sigma}_1\) and \({\sigma}_2\) are unknown is based on the assumption that the two population standard deviations are equal \(({\sigma}_1 = {\sigma}_2 = {\sigma})\). Under this assumption, the two sample standard deviations are combined to provide the pooled sample variance as given in equation (32.10).

\[{s}_p^2 = \frac{({n}_1 - 1){s}_1^2 + ({n}_2 - 1){s}_2^2}{{n}_1 + {n}_2 - 2} \tag{32.10}\]

The t test statistic becomes (32.11) with \(({n}_1 + {n}_2 - 2)\) degrees of freedom.

\[t = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{s}_p\sqrt{\frac{1}{{n}_1} + \frac{1}{{n}_2}}} \tag{32.11}\]

Then the computation of the p-value and the interpretation of the sample results are same as earlier.

  • Caution: A difficulty with this procedure is that the assumption that the two population standard deviations are equal is usually difficult to verify.
    • Unequal population standard deviations are frequently encountered.
    • Using the pooled procedure may not provide satisfactory results, especially if the sample sizes \({n}_1\) and \({n}_2\) are quite different.
    • The original t procedure does not require this assumption. It is a more general procedure and is recommended for most applications.

Pooled Code

(External) Unpaired Two-Samples T-test

# #Pooled Sample Variance
# #Evaluate differences in education quality between two training centers
# #Generate Data for Two Centers
set.seed(3)
setA <- rnorm(n = 30, mean = 82, sd = 10)
setB <- rnorm(n = 40, mean = 78, sd = 10)
bb <- tibble(sets = c(rep("setA", length(setA)), rep("setB", length(setB))), values = c(setA, setB))
# Summary
bb %>% group_by(sets) %>% summarise(Count = n(), Mean = mean(values), SD = sd(values))
## # A tibble: 2 x 4
##   sets  Count  Mean    SD
##   <chr> <int> <dbl> <dbl>
## 1 setA     30  79.7  8.10
## 2 setB     40  78.8  9.43
#
# #Assumption 1: Are the two samples independents
# #YES
#
# #Assumtion 2: Are the data from each of the 2 groups follow a normal distribution
# #Shapiro-Wilk normality test
isNormal_A <- with(bb, shapiro.test(values[sets == "setA"]))
isNormal_B <- with(bb, shapiro.test(values[sets == "setB"]))
#isNormal_A
#isNormal_A$statistic
# #p-value > 0.05 is needed for Normality
isNormal_A$p.value
## [1] 0.1032747
isNormal_B$p.value
## [1] 0.15595
#
# #Both have p-values greater than the significance level alpha = 0.05.
# #implying that the distribution of the data are not significantly different from the normal
# #In other words, we can assume the normality.
#
bb %>% 
  group_by(sets) %>% 
  summarise(p = shapiro.test(values)$p.value)
## # A tibble: 2 x 2
##   sets      p
##   <chr> <dbl>
## 1 setA  0.103
## 2 setB  0.156
#
# #Assumption 3. Do the two populations have the same variances
# #We will use F-test to test for homogeneity in variances. (p-value > 0.05 is needed)
#
bb_testF <- var.test(values ~ sets, data = bb)
# bb_testF
bb_testF$p.value
## [1] 0.3990791
#
# #The p-value of F-test is greater than the significance level alpha = 0.05. 
# #In conclusion, there is no significant difference between the variances of the two sets of data.
# #Therefore, we can use the classic t-test witch assume equality of the two variances.
#
# #Compute unpaired two-samples t-test
#
# #Question : Is there any significant difference between the mean of two populations
#
bb_testT <- t.test(formula = values ~ sets, data = bb, var.equal = TRUE)
bb_testT
## 
##  Two Sample t-test
## 
## data:  values by sets
## t = 0.41947, df = 68, p-value = 0.6762
## alternative hypothesis: true difference in means between group setA and group setB is not equal to 0
## 95 percent confidence interval:
##  -3.383982  5.185373
## sample estimates:
## mean in group setA mean in group setB 
##           79.66783           78.76714
#
cat(paste0("t is the t-test statistic value (t = ", round(bb_testT$statistic, 6), ")\n"))
## t is the t-test statistic value (t = 0.419474)
cat(paste0("df (n1 + n2 - 2) is the degrees of freedom (df = ", bb_testT$parameter, ")\n"))
## df (n1 + n2 - 2) is the degrees of freedom (df = 68)
cat(paste0("p-value is the significance level of the t-test (p-value = ", 
           round(bb_testT$p.value, 6), ")\n"))
## p-value is the significance level of the t-test (p-value = 0.676192)
cat(paste0("conf.int is the confidence interval of the mean at 95% (conf.int = [",
           paste0(round(bb_testT$conf.int, 3), collapse = ", "), "])\n"))
## conf.int is the confidence interval of the mean at 95% (conf.int = [-3.384, 5.185])
cat(paste0("sample estimates is the mean value of the samples. Mean: ", 
           paste0(round(bb_testT$estimate, 2), collapse = ", "), "\n"))
## sample estimates is the mean value of the samples. Mean: 79.67, 78.77
#
# #Compare p-value with alpha = 0.05
alpha <- 0.05
if(bb_testT$p.value >= alpha) {
  cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is greater than alpha (", alpha, 
      "). We failed to reject H0. We cannot conclude that the populations are different.\n")) 
} else {
    cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is less than alpha (", alpha, 
      ")\n. We can reject the H0 with 95% confidence. The populations are different.\n"))
}
## p-value (0.676192) is greater than alpha (0.05). We failed to reject H0. We cannot conclude that the populations are different.

Columns Summary

bb <- xxSoftware
str(bb)
## tibble [12 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Old: num [1:12] 300 280 344 385 372 360 288 321 376 290 ...
##  $ New: num [1:12] 274 220 308 336 198 300 315 258 318 310 ...
#
# #Applying Multiple Functions with Summarise but Output as Cross-Table
# #(Original) Columns as Rows, Functions as Columns
# #NOTE: n() can be applied as lambda function 
bb %>% 
  pivot_longer(everything(), names_to = "key", values_to = "value") %>% 
  group_by(key) %>% 
  summarise(across(value, list(N = ~n(), Count = length, Mean = mean, SD = sd), .names = "{.fn}"))
## # A tibble: 2 x 5
##   key       N Count  Mean    SD
##   <chr> <int> <int> <dbl> <dbl>
## 1 New      12    12   286  44.0
## 2 Old      12    12   325  40.0

pivot_longer()

str(bb)
## tibble [12 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Old: num [1:12] 300 280 344 385 372 360 288 321 376 290 ...
##  $ New: num [1:12] 274 220 308 336 198 300 315 258 318 310 ...
#
# #gather() is deprecated. Here is for reference.
# #Longer Tibble is filled with All Values of Col A, then All Values fo Col B and so on
ii <- gather(bb)
jj <- bb %>% gather("key", "value")
kk <- bb %>% gather("key", "value", everything()) 
#
# #pivot_longer()
# #Longer Tibble is filled with First Row of All Columns, then 2nd Row of All Columns and so on
ll <- bb %>% pivot_longer(everything(), names_to = "key", values_to = "value") %>% arrange(desc(key))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll)))

Multiple Functions

str(bb)
## tibble [12 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Old: num [1:12] 300 280 344 385 372 360 288 321 376 290 ...
##  $ New: num [1:12] 274 220 308 336 198 300 315 258 318 310 ...
# #Store a Grouped Tibble
ii <- bb %>% 
  pivot_longer(everything(), names_to = "key", values_to = "value") %>% 
  group_by(key) 
str(ii)
## grouped_df [24 x 2] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ key  : chr [1:24] "Old" "New" "Old" "New" ...
##  $ value: num [1:24] 300 274 280 220 344 308 385 336 372 198 ...
##  - attr(*, "groups")= tibble [2 x 2] (S3: tbl_df/tbl/data.frame)
##   ..$ key  : chr [1:2] "New" "Old"
##   ..$ .rows: list<int> [1:2] 
##   .. ..$ : int [1:12] 2 4 6 8 10 12 14 16 18 20 ...
##   .. ..$ : int [1:12] 1 3 5 7 9 11 13 15 17 19 ...
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE
ii %>% summarise(across(value, list(N = ~n(), Count = length, Mean = mean, SD = sd), 
                        .names = "{.fn}"))
## # A tibble: 2 x 5
##   key       N Count  Mean    SD
##   <chr> <int> <int> <dbl> <dbl>
## 1 New      12    12   286  44.0
## 2 Old      12    12   325  40.0
# 
# #Equivalent (except Column Headers)
ii %>% summarise(N = n(), Count = across(value, length), 
                 Mean = across(value, mean), SD = across(value, sd))
## # A tibble: 2 x 5
##   key       N Count$value Mean$value SD$value
##   <chr> <int>       <int>      <dbl>    <dbl>
## 1 New      12          12        286     44.0
## 2 Old      12          12        325     40.0

32.6 Exercises

“ForLater”

09

  • (1:) \({n}_1 = 20, {\overline{x}}_1 = 22.5, {\sigma}_1 = 2.5\)
  • (2:) \({n}_2 = 30, {\overline{x}}_2 = 20.1, {\sigma}_2 = 4.8\)
  • (32.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{2.5}^2}{20} + \frac{{4.8}^2}{30}} = 1.0395\)
    • sqrt(2.5^2/20 + 4.8^2/30) \(\#\mathcal{R}\)
  • point estimate of the difference between the two population means
    • (32.1) \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )} = 22.5 - 20.1 = 2.4\)
  • (32.8) \(\text{DOF} = 45\)
    • floor({2.5^2 / 20 + 4.8^2 / 30 }^2 / {{2.5^2 / 20}^2/{20-1} + {4.8^2 / 30}^2/{30-1}}) \(\#\mathcal{R}\)
  • \({\gamma = 0.95} \iff{\alpha} = 0.05 \to {{}^2\!t_{{\alpha}/2}} = {{}^2\!t_{0.025}} = 2.014\)
    • qt(p = 0.025, df = 45, lower.tail = FALSE) \(\#\mathcal{R}\)
  • (32.6) \(\text{MOE}_{\gamma =0.95} = 2.014 * 1.0395 = 2.094 \approx 2.1\)
  • (32.7) \(\text{Interval Estimate}_{\gamma} = 2.4 \pm 2.1\)

10

32.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

  • (1:) \({n}_1 = 35, {\overline{x}}_1 = 13.6, {\sigma}_1 = 5.2\)
  • (2:) \({n}_2 = 40, {\overline{x}}_2 = 10.1, {\sigma}_2 = 8.5\)
  • (32.1) \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )} = 13.6 - 10.1 = 3.5\)
  • (32.2) \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{5.2}^2}{35} + \frac{{8.5}^2}{40}} = 1.6059\)
    • sqrt(5.2^2/35 + 8.5^2/40) \(\#\mathcal{R}\)
  • (32.8) \(\text{DOF} = 65\)
    • floor({5.2^2 / 35 + 8.5^2 / 40 }^2 / {{5.2^2 / 35}^2/{35-1} + {8.5^2 / 40}^2/{40-1}}) \(\#\mathcal{R}\)
  • (32.9) \(t = \frac{(13.6 - 10.1) - 0}{1.6059} = 2.179\)
  • \({}^2\!P_{(t = 2.179)} = 2 * {}^U\!P_{(t = 2.179)} = 2 * 0.0165 = 0.033\)
    • pt(q = 2.179, df = 65, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^2\!P_{(t)} < {\alpha} \to {H_0}\) is rejected

32.7 Matched Samples (Paired)

Suppose employees at a manufacturing company can use two different methods to perform a production task. To maximize production output, the company wants to identify the method with the smaller population mean completion time. We can use two alternative designs for the sampling procedure.

Definition 32.6 Independent sample design: A simple random sample of workers is selected and each worker in the sample uses method 1. A second independent simple random sample of workers is selected and each worker in this sample uses method 2.
Definition 32.7 Matched sample design: One simple random sample of workers is selected. Each worker first uses one method and then uses the other method. The order of the two methods is assigned randomly to the workers, with some workers performing method 1 first and others performing method 2 first. Each worker provides a pair of data values, one value for method 1 and another value for method 2.

In the matched sample design the two production methods are tested under similar conditions (i.e., with the same workers); hence this design often leads to a smaller sampling error than the independent sample design.

The primary reason is that in a matched sample design, variation between workers is eliminated because the same workers are used for both production methods.

The key to the analysis of the matched sample design is to realize that we consider only the column of differences.

The matched sample design is generally preferred to the independent sample design because the matched-sample procedure often improves the precision of the estimate.

Let \({\mu}_d\) = the mean of the difference in values for the population

Definition 32.8 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_d = 0 \iff {H_a}: {\mu}_d \neq 0\)

Sample Mean is given by (32.12) like (25.6) and Sample Standard Deviation is given by (32.13) like (25.12)

\[\overline{d} = \frac{\sum{{d}_i}}{n} \tag{32.12}\]

\[{s}_d = \sqrt{\frac{\sum ({d}_i - \overline{d})^2}{n-1}} \tag{32.13}\]

Test Statistic with \((n-1)\) degrees of freedom : Refer (32.14) like (31.3)

\[t = \frac{\overline{d} - {\mu}_d}{{s}_d/\sqrt{n}} \tag{32.14}\]

Margin of Error (\(\text{MOE}_{{\gamma}}\)) : Refer (32.15) like (32.3)

\[\text{MOE}_{{\gamma}} = {t}_{\frac{{\alpha}}{2}}\frac{{s}_d}{\sqrt{n}} \tag{32.15}\]

\(\text{Interval Estimate}_{\gamma}\) : Refer (32.16) like (32.4)

\[\text{Interval Estimate}_{\gamma} = \overline{d} \pm {t}_{\frac{{\alpha}}{2}}\frac{{s}_d}{\sqrt{n}} \tag{32.16}\]

“ForLater” - Exercise

Example

Example: Two Production Methods

  • (1: Method 1) \(\{6, 5, 7, 6.2, 6, 6.4\}\)
  • (2: Method 2) \(\{5.4, 5.2, 6.5, 5.9, 6, 5.8\}\)
  • Difference: \({d}_i =\{0.6, -0.2, 0.5, 0.3, 0, 0.6\}\)
  • \({n} = 6, {\overline{d}} = 0.3, {{s}_d} = 0.335, \text{DOF} = 5, \text{SE} = {{s}_d}/\sqrt{n} = 0.1366\)
  • (32.14) \(t = \frac{\overline{d} - {\mu}_d}{{s}_d/\sqrt{n}} = \frac{0.30 - 0}{0.335/\sqrt{6}} = 2.196\)
    • {mean(bb$d) - 0 } / {sd(bb$d) / sqrt(length(bb$d))} \(\#\mathcal{R}\)
  • Calculate \({}^2\!P_{(t)}\)
    • \({}^2\!P_{(t = 2.196)} = 2 * {}^U\!P_{(t = 2.196)} = 2 * 0.0397 = 0.07949\)
      • 2 * pt(q = 2.196, df = 5, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^2\!P_{(z)} > {\alpha} \to {H_0}\) cannot be rejected
  • 95% Confidence Interval can also be estimated
    • \({\gamma = 0.95} \iff{\alpha} = 0.05 \to {{}^2\!t_{{\alpha}/2}} = {{}^2\!t_{0.025}} = 2.57\)
    • qt(p = 0.025, df = 5, lower.tail = FALSE) \(\#\mathcal{R}\)
  • (32.15) \(\text{MOE}_{\gamma =0.95} = 2.57 * 0.1366 = 0.35\)
  • (32.16) \(\text{Interval Estimate}_{\gamma} = 0.3 \pm 0.35\)

Code

# #Matched Samples: Same workers providing data for two methods
xxMatchedMethods <- tibble(M1 = c(6, 5, 7, 6.2, 6, 6.4), 
                           M2 = c(5.4, 5.2, 6.5, 5.9, 6, 5.8))
aa <- xxMatchedMethods
# #Get Differnce
bb <- aa %>% mutate(d = M1-M2)
str(bb)
## tibble [6 x 3] (S3: tbl_df/tbl/data.frame)
##  $ M1: num [1:6] 6 5 7 6.2 6 6.4
##  $ M2: num [1:6] 5.4 5.2 6.5 5.9 6 5.8
##  $ d : num [1:6] 0.6 -0.2 0.5 0.3 0 ...
paste0(round(bb[3], 1))
## [1] "c(0.6, -0.2, 0.5, 0.3, 0, 0.6)"
#
cat(paste0("- ${n} = ", length(bb$d), ", {\\overline{d}} = ", round(mean(bb$d), 1), 
           ", {{s}_d} = ", round(sd(bb$d), 3), "$\n"))
## - ${n} = 6, {\overline{d}} = 0.3, {{s}_d} = 0.335$
cat(paste0("t = ", round({mean(bb$d) - 0 } / {sd(bb$d) / sqrt(length(bb$d))}, 3), "\n"))
## t = 2.196
#
# #Paired t-test
bb <- aa %>% pivot_longer(everything(), names_to = "key", values_to = "value")
#
# #Welch Two Sample t-test
# #Alternative must be: "two.sided" (Default), "less", "greater"
bb_ha <- "two.sided"
bb_testT <- t.test(formula = value ~ key, data = bb, alternative = bb_ha, paired = TRUE)
bb_testT
## 
##  Paired t-test
## 
## data:  value by key
## t = 2.1958, df = 5, p-value = 0.07952
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.05120834  0.65120834
## sample estimates:
## mean of the differences 
##                     0.3

32.8 Population Proportions

To make an inference about the difference between the two population proportions \(({p}_1 - {p}_2)\), we select a simple random sample of \({n}_1\) units from population 1 and a second simple random sample of \({n}_2\) units from population 2. Let \(\overline{{p}}_1, \overline{{p}}_2\) denote the sample proportions for simple random sample from populations 1 and 2.

\(({x})\) denotes Count of Success

Interval Estimation of \(({p}_1 - {p}_2)\)

The point estimator of the difference between two population proportions is the difference between the sample proportions of two independent simple random samples.

Point Estimate : Refer (32.17) like (32.1)

\[E_{( {\overline{p}}_1 - {\overline{p}}_2 )} = {\overline{p}}_1 - {\overline{p}}_2 \tag{32.17}\]

As with other point estimators, the point estimator \(E_{( {\overline{p}}_1 - {\overline{p}}_2 )}\) has a standard error \({\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)}\), that describes the variation in the sampling distribution of the estimator.

The two population proportions, \(({p}_1, {p}_2)\), are unknown. Thus, sample proportions \(({\overline{p}}_1, {\overline{p}}_2)\) are being used to estimate them.

Standard Error of \(({\overline{p}}_1 - {\overline{p}}_2)\) : Refer equation (32.18) like (32.2)

\[\begin{align} {\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} &= \sqrt{\frac{{p}_1 (1-{p}_1)}{{n}_1} + \frac{{p}_2 (1-{p}_2)}{{n}_2}} \\ &= \sqrt{\frac{{\overline{p}}_1 (1-{\overline{p}}_1)}{{n}_1} + \frac{{\overline{p}}_2 (1-{\overline{p}}_2)}{{n}_2}} \end{align} \tag{32.18}\]

Margin of Error (\(\text{MOE}_{{\gamma}}\)) : Refer equation (32.19) like (32.3)

\[\text{MOE}_{{\gamma}} = {z}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} = {z}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\overline{p}}_1 (1-{\overline{p}}_1)}{{n}_1} + \frac{{\overline{p}}_2 (1-{\overline{p}}_2)}{{n}_2}} \tag{32.19}\]

\(\text{Interval Estimate}_{\gamma}\) : Refer equation (32.20) like (32.4)

\[\text{Interval Estimate}_{\gamma} = ({\overline{p}}_1 - {\overline{p}}_2) \pm {z}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\overline{p}}_1 (1-{\overline{p}}_1)}{{n}_1} + \frac{{\overline{p}}_2 (1-{\overline{p}}_2)}{{n}_2}} \tag{32.20}\]

Example: Tax Preparation: (Count of Success \(({x})\) is Number of Returns with Errors)

  • (1: Office 1) \(\{{n}_1 = 250, {x}_1 = 35\} \to {\overline{p}}_1 = {n}_1/{x}_1 = 0.14\)
  • (2: Office 2) \(\{{n}_2 = 300, {x}_2 = 27\} \to {\overline{p}}_2 = {n}_2/{x}_2 = 0.09\)
  • point estimate
    • (32.17) \(E_{( {\overline{p}}_1 - {\overline{p}}_2 )} = 0.14 - 0.09 = 0.05\)
      • Thus, we estimate that Office 1 has a .05, or 5%, greater error rate than Office 2.
  • For \({\gamma = 0.90} \iff{\alpha} = 0.10 \to {z_{{\alpha}/2}} = {z_{0.05}} = 1.645\)
    • qnorm(p = 0.05, lower.tail = FALSE) \(\#\mathcal{R}\)
  • (32.18) \({\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} = 0.0275\)
    • sqrt(0.14 * (1 - 0.14) / 250 + 0.09 * (1 - 0.09) / 300) \(\#\mathcal{R}\)
  • (32.19) \(\text{MOE}_{\gamma} = {z}_{\frac{{\alpha}}{2}} {\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} = 1.645 * 0.0275 = 0.045\)
  • (32.20) \(\text{Interval Estimate}_{\gamma} = ({\overline{p}}_1 - {\overline{p}}_2) \pm \text{MOE}_{\gamma} = 0.05 \pm 0.045\)

32.8.1 Hypothesis

Definition 32.9 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {p}_1 - {p}_2 \geq 0 \iff {H_a}: {p}_1 - {p}_2 < 0\)
Definition 32.10 \(\text{\{Right or Upper\} } {H_0} : {p}_1 - {p}_2 \leq 0 \iff {H_a}: {p}_1 - {p}_2 > 0\)
Definition 32.11 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {p}_1 - {p}_2 = 0 \iff {H_a}: {p}_1 - {p}_2 \neq 0\)

When we assume \({H_0}\) is true as an equality, we have \({p}_1 - {p}_2 = 0\), which is the same as saying that the population proportions are equal, \({p}_1 = {p}_2 = {p}\). The equation (32.18) becomes (32.21)

Standard Error of \(({\overline{p}}_1 - {\overline{p}}_2)\) : When \({p}_1 = {p}_2 = {p}\)

\[\begin{align} {\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} &= \sqrt{{p} (1-{p})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)} \\ &= \sqrt{{\overline{p}} (1-{\overline{p}})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)} \end{align} \tag{32.21}\]

With \({p}\) unknown, we pool, or combine, the point estimators from the two samples \(({p}_1, {p}_2)\) to obtain a single point estimator of \({p}\) is given by (32.22)

Pooled Estimator of \({p}\) : When \({p}_1 = {p}_2 = {p}\)

\[{\overline{p}} = \frac{{n}_1 {\overline{p}}_1 + {n}_2 {\overline{p}}_2}{{n}_1 + {n}_2} \tag{32.22}\]

Test Statistic for Hypothesis Tests :

\[z = \frac{({\overline{p}}_1 - {\overline{p}}_2)}{{\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)}} = \frac{({\overline{p}}_1 - {\overline{p}}_2)}{\sqrt{{\overline{p}} (1-{\overline{p}})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)}} \tag{32.23}\]

Example: Tax Preparation: Continuation

32.11 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {p}_1 - {p}_2 = 0 \iff {H_a}: {p}_1 - {p}_2 \neq 0\)

  • (32.22) \({\overline{p}} = \frac{{n}_1 {\overline{p}}_1+ {n}_2 {\overline{p}}_2}{{n}_1 + {n}_2} = \frac{250 * 0.14 + 300 * 0.09}{250 + 300} = 0.1127\)
    • {250 * 0.14 + 300 * 0.09} / {250 + 300} \(\#\mathcal{R}\)
  • (32.23) \(z = \frac{({\overline{p}}_1 - {\overline{p}}_2)}{{\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)}} = \frac{({\overline{p}}_1 - {\overline{p}}_2)}{\sqrt{{\overline{p}} (1-{\overline{p}})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)}}\)
    • \(z = \frac{(0.14 - 0.09)}{\sqrt{0.1127 (1-0.1127)\left(\frac{1}{250} + \frac{1}{300}\right)}} = \frac{(0.14 - 0.09)}{0.0271} = 1.845\)
    • NOTE: The denominator calculated in this manner \((0.0271)\) is very close to the originally calculated \({\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} = 0.0275\)
  • Calculate \({}^2\!P_{(z)}\)
    • \({}^2\!P_{(z = 1.845)} = 2 * {}^U\!P_{(z = 1.845)} = 2 * 0.0325 = 0.065\)
      • 2 * pnorm(q = 1.845, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.01\)
    • \({}^2\!P_{(z)} < {\alpha} \to {H_0}\) is rejected i.e. the proportions are different
    • The firm can conclude that the error rates differ between the two offices.

Code

(External) Two-Proportions Z-Test

bb_ha <- "two.sided"
bb_gamma <- 0.90
bb_propT <- prop.test(x = c(35, 27), n = c(250, 300), alternative = bb_ha, 
                      conf.level = bb_gamma, correct = FALSE)
bb_propT
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(35, 27) out of c(250, 300)
## X-squared = 3.4084, df = 1, p-value = 0.06486
## alternative hypothesis: two.sided
## 90 percent confidence interval:
##  0.004815898 0.095184102
## sample estimates:
## prop 1 prop 2 
##   0.14   0.09
names(bb_propT)
## [1] "statistic"   "parameter"   "p.value"     "estimate"    "null.value"  "conf.int"   
## [7] "alternative" "method"      "data.name"
#
# #X-squared might be the square of calculated z-value
cat(paste0("X-squared is the number of Successes (X-squared = ", round(bb_propT$statistic, 6), ")\n"))
## X-squared is the number of Successes (X-squared = 3.408415)
#
# #By default, the function prop.test() uses the Yates continuity correction
# #It is important if either the expected successes or failures is < 5. 
# #If you do not want the correction, use the additional argument correct = FALSE. 
# #i.e. To make the test mathematically equivalent to the uncorrected z-test of a proportion.
cat(paste0("p-value is the significance level of the t-test (p-value = ", 
           round(bb_propT$p.value, 6), ")\n"))
## p-value is the significance level of the t-test (p-value = 0.064865)
cat(paste0("conf.int is the confidence interval for the probability of success at ", bb_gamma,
           " level (conf.int = [", paste0(round(bb_propT$conf.int, 3), collapse = ", "), "])\n"))
## conf.int is the confidence interval for the probability of success at 0.9 level (conf.int = [0.005, 0.095])
cat(paste0("sample estimates is the the estimated probability of success. p: ", 
           paste0(round(bb_propT$estimate, 2), collapse = ", "), "\n"))
## sample estimates is the the estimated probability of success. p: 0.14, 0.09

Validation


33 Variance

33.1 Overview

  • “Inferences About Population Variances”
    • “ForLater” - Hypothesis Testing, Inferences About Two Population Variances

33.2 Inferences About a Population Variance

In many manufacturing applications, controlling the process variance is extremely important in maintaining quality.

The sample variance \({s^2}\), given by equation (33.1), is the point estimator of the population variance \({\sigma}^2\).

\[{s^2} = \frac{\sum {({x}_i - {\overline{x}})}^2}{n-1} \tag{33.1}\]

Definition 33.1 Whenever a simple random sample of size \({n}\) is selected from a normal population, the sampling distribution of \(\frac{(n-1)s^2}{{\sigma}^2}\) is a chi-square distribution with \({n − 1}\) degrees of freedom.

Note:

  • The chi-square distribution is based on sampling from a normal population.
  • It can be used to develop interval estimates and conduct hypothesis tests about a population variance.
  • The notation \({\chi_{\alpha}^2}\) denotes the value for the chi-square distribution that provides an area or probability of \({\alpha}\) to the right of the \({\chi_{\alpha}^2}\) value.

33.2.1 Interval Estimation

Example: A sample of 20 containers \({n = 20}\) has the sample variance \({s^2} = 0.0025\)

  • \({\chi_{\alpha = 0.025}^2} = 32.852\)
    • qchisq(p = 0.025, df = 19, lower.tail = FALSE) \(\#\mathcal{R}\)
    • It indicates that 2.5% of the chi-square values are to the right of 32.852
    • Also, \({\chi_{\alpha = 0.975}^2} = 8.907\) indicates that 97.5% of the chi-square values are to the right of 8.907.
      • qchisq(p = 0.975, df = 19, lower.tail = FALSE) \(\#\mathcal{R}\)
    • Thus, 95% of the chi-square values are between \({\chi_{\alpha = 0.975}^2}\) and \({\chi_{\alpha = 0.025}^2}\)
    • There is a .95 probability of obtaining a \({\chi^2}\) value such that \({\chi_{0.975}^2} \leq {\chi^2} \leq{\chi_{0.025}^2}\)
# #pnorm() qnorm() | pt() qt() | pchisq() qchisq() | pf() qf() 
#
# #p-value approach: Find Commulative Probability P corresponding to the given ChiSq & DOF=59
pchisq(q = 32.852, df = 19, lower.tail = FALSE)
## [1] 0.02500216
#
# #ChiSq value for which Area under the curve towards Right is alpha=0.025 & DOF=19 #32.852
qchisq(p = 0.025, df = 19, lower.tail = FALSE)
## [1] 32.85233
  • Using equation (33.1), we can get (33.2), which provides a 95% confidence interval estimate for the population variance \({\sigma}^2\).

\[\frac{(n-1)s^2}{{\chi_{0.025}^2}} \leq {\sigma}^2 \leq \frac{(n-1)s^2}{{\chi_{0.975}^2}} \tag{33.2}\]

  • In the example, \((n-1)s^2 = 19 * 0.0025 = 0.0475\)
  • (33.2) \(\frac{0.0475}{32.852} \leq {\sigma}^2 \leq \frac{0.0475}{8.907} \to 0.0014 \leq {\sigma}^2 \leq 0.0053 \to 0.0380 \leq {\sigma} \leq 0.0730\)
    • which gives the 95% confidence interval for the population standard deviation

Generalising the equation (33.2), the equation (33.3) is the interval estimate of a population variance.

\[\frac{(n-1)s^2}{{\chi_{{\alpha}/2}^2}} \leq {\sigma}^2 \leq \frac{(n-1)s^2}{{\chi_{{1-\alpha}/2}^2}} \tag{33.3}\]

where the \({\chi^2}\) values are based on a chi-square distribution with \({n-1}\) degrees of freedom and where \((1 − {\alpha})\) is the confidence coefficient.

33.2.2 Hypothesis Tests

Using \({{\sigma}_0^2}\) to denote the hypothesized value for the population variance, the three forms for a hypothesis test are as follows:

Definition 33.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\sigma}^2 \geq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 < {{\sigma}_0^2}\)
Definition 33.3 \(\text{\{Right or Upper\} } {H_0} : {\sigma}^2 \leq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 > {{\sigma}_0^2}\)
Definition 33.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}^2 = {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 \neq {{\sigma}_0^2}\)

Note: In general, Upper Tail test is the most frequently observed test because low variances are generally desirable. With a statement about the maximum allowable population variance, we can test the null hypothesis that the population variance is less than or equal to the maximum allowable value against the alternative hypothesis that the population variance is greater than the maximum allowable value. With this test structure, corrective action will be taken whenever rejection of the null hypothesis indicates the presence of an excessive population variance.

Test Statistic for Hypothesis Tests About a Population Variance: Refer (33.4), where \({\chi^2}\) has a chi-square distribution with \({n - 1}\) degrees of freedom.

\[{\chi^2} = \frac{(n - 1){s}^2}{{\sigma}_0^2} \tag{33.4}\]

Example: Louis: the company standard specifies an arrival time variance of 4 or less

33.3 \(\text{\{Right or Upper\} } {H_0} : {\sigma}^2 \leq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 > {{\sigma}_0^2}\)

  • (Sample) \({n} = 24, {s}^2 = 4.9\)
  • (33.4) \({\chi^2} = \frac{(n - 1){s}^2}{{\sigma}_0^2} = \frac{(24 - 1) * 4.9}{4} = 28.18\)
  • \({}^U\!P_{(\chi^2)} = 0.209\)
    • pchisq(q = 28.18, df = 23, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(\chi^2)} > {\alpha} \to {H_0}\) cannot be rejected
    • The sample results do not provide sufficient evidence to conclude that the variance is high.

Example: bureau of motor vehicles: Evaluate the variance in the new examination test scores with the historical value \({\sigma}_0^2 = 100\)

33.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}^2 = {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 \neq {{\sigma}_0^2}\)

  • (Sample) \({n} = 30, {s}^2 = 162\)
  • (33.4) \({\chi^2} = \frac{(n - 1){s}^2}{{\sigma}_0^2} = \frac{(30 - 1) * 162}{100} = 46.98\)
  • \({}^2\!P_{(\chi^2)} = 2 * {}^U\!P_{(\chi^2)} = 2 * 0.0187 = 0.0374\)
    • 2 * pchisq(q = 46.98, df = 29, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^2\!P_{(\chi^2)} < {\alpha} \to {H_0}\) is rejected i.e. the change is significant
    • The new examination test scores have a population variance different from the historical variance

33.3 Inferences About Two Population Variances

The two sample variances \({s}_1^2\) and \({s}_2^2\) will be the basis for making inferences about the two population variances \({\sigma}_1^2\) and \({\sigma}_2^2\).

Definition 33.5 Whenever independent simple random samples of sizes \({n}_1\) and \({n}_2\) are selected from two normal populations with equal variances \(({\sigma}_1^2 = {\sigma}_2^2)\), the sampling distribution of \(\frac{{s}_1^2}{{s}_2^2}\) is an F distribution with \(({n}_1 - 1)\) degrees of freedom for the numerator and \(({n}_2 - 1)\) degrees of freedom for the denominator.

Note:

  • The F distribution is based on sampling from two normal populations.
  • The F distribution is not symmetric, and the F values can never be negative.
    • The shape of any particular F distribution depends on its numerator and denominator degrees of freedom.
  • We refer to the population providing the larger sample variance as population 1.
    • Because the F test statistic is constructed with the larger sample variance \({s}_1^2\) in the numerator, the value of the test statistic will be in the upper tail of the F distribution.

Test Statistic for Hypothesis Tests About Population Variances with \(({\sigma}_1^2 = {\sigma}_2^2)\) : Refer equation (33.5)

\[F = \frac{{s}_1^2}{{s}_2^2} \tag{33.5}\]

Hypothesis Tests :

Definition 33.6 \(\text{\{Left or Lower \} }\space\thinspace \text{Do not do this.}\)
Definition 33.7 \(\text{\{Right or Upper\} } {H_0} : {\sigma}_1^2 \leq {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 > {\sigma}_2^2\)
Definition 33.8 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}_1^2 = {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 \neq {\sigma}_2^2\)

Example: Dullus County Schools:

33.8 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}_1^2 = {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 \neq {\sigma}_2^2\)

  • (1: Milbank) \({n} = 26, {s}_1^2 = 48\)
  • (2: Gulf ) \({n} = 16, {s}_1^2 = 20\)
  • (33.5) \(F = \frac{{s}_1^2}{{s}_2^2} = \frac{48}{20} = 2.4\)
  • \({}^2\!P_{(F)} = 2 * {}^U\!P_{(F)} = 2 * 0.0406 = 0.0812\)
    • 2 * pf(q = 2.4, df1 = 25, df2 = 15, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.10\)
    • \({}^2\!P_{(F)} < {\alpha} \to {H_0}\) is rejected i.e. The two populations have difference variances
    • The sample results provide sufficient evidence to conclude that the variances are different.

Example: public opinion survey: do women show a greater variation in attitude on political issues than men

33.7 \(\text{\{Right or Upper\} } {H_0} : {\sigma}_1^2 \leq {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 > {\sigma}_2^2\)

  • (1: Women) \({n} = 41, {s}_1^2 = 120\)
  • (2: Men) \({n} = 31, {s}_1^2 = 80\)
  • (33.5) \(F = \frac{{s}_1^2}{{s}_2^2} = \frac{120}{80} = 1.5\)
  • \({}^U\!P_{(F)} = 0.1256\)
    • pf(q = 1.5, df1 = 40, df2 = 30, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(F)} > {\alpha} \to {H_0}\) cannot be rejected i.e. The two populations have same variances
    • The sample results does not provide sufficient evidence to conclude that women have higher variance in political opinion compared to men

Validation


34 Independence

34.1 Overview

  • “Comparing Multiple Proportions, Test of Independence and Goodness of Fit”
    • “ForLater” - Everything

34.2 Introduction

Hypothesis-testing procedures that expand our capacity for making statistical inferences about populations

  • The test statistic used in conducting the hypothesis tests in this chapter is based on the chi-square \({\chi^2}\) distribution.
  • In all cases, the data are categorical.
  • Applications
    • Testing the equality of population proportions for three or more populations
    • Testing the independence of two categorical variables
    • Testing whether a probability distribution for a population follows a specific historical or theoretical probability distribution

34.3 Testing the Equality of Population Proportions for Three or More Populations

Definition 34.1 \(\text{\{Equality of Population Proportions \}} {H_0} : {p}_1 = {p}_2 = \dots = {p}_k \iff {H_a}: \text{Not all population proportions are equal}\)

where \({p}_j\) is population proportion of the \(j^{\text{th}}\) population. We assume that a simple random sample of size \({n}_j\) has been selected from each of the \({k}\) populations or treatments.

Select a random sample from each of the populations and record the observed frequencies, \(f_{ij}\), in a table with 2 rows and k columns.

Expected Frequencies Under the Assumption \({H_0}\) is true : Refer equation (34.1)

\[e_{ij} = \frac{(\text{Row } i \text{ Total})(\text{Column } j \text{ Total})}{\text{Total Sample Size}} \tag{34.1}\]

Chi-Square Test Statistic : Refer equation (34.2)

\[{\chi^2} = \sum_{i}{\sum_{j}{\frac{(f_{ij} - e_{ij})^2}{e_{ij}}}} \tag{34.2}\]

Where:

\[\begin{align} f_{ij} &= \text{observed frequency for the cell in row } i \text{ and column } j \\ e_{ij} &= \text{expected frequency for the cell in row } i \text{ and column } j \end{align}\]

Note: In a chi-square test involving the equality of \({k}\) population proportions, the above test statistic has a chi-square distribution with \({k - 1}\) degrees of freedom provided the expected frequency is 5 or more for each cell.

A chi-square test for equal population proportions will always be an upper tail test with rejection of \({H_0}\) occurring when the test statistic is in the upper tail of the chi-squre distribution.

In studies such as these, we often use the same sample size for each population. We have chosen different sample sizes in this example to show that the chi-square test is not restricted to equal sample sizes for each of the k populations.

“ForLater” - Creating the ChiSq Table

Example: JD Power: Compare customer loyalty for three automobiles by using the proportion of owners likely to repurchase a particular automobile

  • Count of Success \(({x})\) is Number of Owners likely to repurchase
  • (1: Impala) \(\{{n}_1 = 125, {x}_1 = 69\}\)
  • (1: Fusion) \(\{{n}_2 = 200, {x}_2 = 120\}\)
  • (1: Accord) \(\{{n}_2 = 175, {x}_3 = 123\}\)
  • (34.2) \({\chi^2} = 7.89\)
  • \(P_{\chi^2} = 0.0193\)
    • pchisq(q = 7.89, df = 2,lower.tail = FALSE) \(\#\mathcal{R}\)

“ForLater” - Marascuilo procedure

34.4 Test of Independence

An important application of a chi-square test involves using sample data to test for the independence of two categorical variables. The null hypothesis for this test is that the two categorical variables are independent. Thus, the test is referred to as a test of independence.

Example: Beer: Preference vs. gender

  • Since an objective of the study is to determine if there is difference between the beer preferences for male and female beer drinkers, we consider gender an explanatory variable and follow the usual practice of making the explanatory variable the column variable in the data tabulation table.
  • The beer preference is the categorical response variable and is shown as the row variable.
  • “ForLater”

34.5 Goodness of Fit Test

  • “ForLater”

34.6 Summary

All tests apply to categorical variables and all tests use a chi-square \({\chi^2}\) test statistic that is based on the differences between observed frequencies and expected frequencies. In each case, expected frequencies are computed under the assumption that the null hypothesis is true. These chi-square tests are upper tailed tests. Large differences between observed and expected frequencies provide a large value for the chi-square test statistic and indicate that the null hypothesis should be rejected.

The test for the equality of population proportions for three or more populations is based on independent random samples selected from each of the populations. The sample data show the counts for each of two categorical responses for each population. The null hypothesis is that the population proportions are equal. Rejection of the null hypothesis supports the conclusion that the population proportions are not all equal.

The test of independence between two categorical variables uses one sample from a population with the data showing the counts for each combination of two categorical variables. The null hypothesis is that the two variables are independent and the test is referred to as a test of independence. If the null hypothesis is rejected, there is statistical evidence of an association or dependency between the two variables.

The goodness of fit test is used to test the hypothesis that a population has a specific historical or theoretical probability distribution. We showed applications for populations with a multinomial probability distribution and with a normal probability distribution. Since the normal probability distribution applies to continuous data, intervals of data values were established to create the categories for the categorical variable required for the goodness of fit test.

Validation


35 ANOVA

35.1 Overview

  • “Experimental Design and Analysis of Variance”
    • “ForLater” - Exercises, Fisher LSD Onwards

35.2 Introduction

Example: Chemitech: Comparision of theree methods of assembly A, B, C in terms of most assemblies per week

  • In this experiment
    • assembly method is the independent variable or factor.
    • Because three assembly methods correspond to this factor, we say that three treatments are associated with this experiment; each treatment corresponds to one of the three assembly methods.
      • The three assembly methods or treatments define the three populations of interest
    • This is an example of a single-factor experiment; it involves one categorical factor (method of assembly)
    • For each population the dependent or response variable is the number of filtration systems assembled per week, and the primary statistical objective of the experiment is to determine whether the mean number of units produced per week is the same for all three populations (methods).
  • Suppose a random sample of three employees is selected from all assembly workers.
    • The three randomly selected workers are the experimental units.
    • The experimental design that we will use is called a completely randomized design. - This type of design requires that each of the three assembly methods or treatments be assigned randomly to one of the experimental units or workers.
Definition 35.1 Randomization is the process of assigning the treatments to the experimental units at random.
  • Suppose that instead of selecting just three workers at random we selected 15 workers and then randomly assigned each of the three treatments to 5 of the workers.
    • Because each method of assembly is assigned to 5 workers, we say that five replicates have been obtained.
  • As given in data, sample means for A, B, C are : \(\{{\overline{x}}_1 = 62, {\overline{x}}_1 = 66, {\overline{x}}_1 = 52\}\)
    • From these data, method B appears to result in higher production rates than either of the other methods.
xxChemitech <- tibble(A = c(58, 64, 55, 66, 67), 
                      B = c(58, 69, 71, 64, 68), 
                      C = c(48, 57, 59, 47, 49))
aa <- xxChemitech
# #Summary
aa %>% 
    pivot_longer(everything(), names_to = "key", values_to = "value") %>% 
    group_by(key) %>% 
    summarise(across(value, 
                     list(Count = length, Mean = mean, SD = sd, Variance = var), 
                     .names = "{.fn}"))
## # A tibble: 3 x 5
##   key   Count  Mean    SD Variance
##   <chr> <int> <dbl> <dbl>    <dbl>
## 1 A         5    62  5.24     27.5
## 2 B         5    66  5.15     26.5
## 3 C         5    52  5.57     31

35.3 Hypothesis

  • The real issue is whether the three sample means observed are different enough for us to conclude that the means of the populations corresponding to the three methods of assembly are different.
    • Let \(\{{\mu}_1, {\mu}_2, {\mu}_3\}\) denote mean number of units produced per week using methods A, B and C
    • we want to use the sample means to test the following hypotheses:

35.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)

If \({H_0}\) is rejected, we cannot conclude that all population means are different. Rejecting \({H_0}\) means that at least two population means have different values.

35.4 Assumptions for Analysis of Variance

Three assumptions are required to use analysis of variance.

  1. For each population, the response variable is normally distributed.
    • Implication: In the example, the number of units produced per week (response variable) must be normally distributed for each assembly method.
  2. The variance of the response variable, \({\sigma}^2\), is the same for all of the populations.
    • Implication: In the example, the variance of the number of units produced per week must be the same for each assembly method.
  3. The observations must be independent.
    • Implication: In the example, the number of units produced per week for each employee must be independent of the number of units produced per week for any other employee.

If the sample sizes are equal, analysis of variance is not sensitive to departures from the assumption of normally distributed populations.

Normality:

  • In ANOVA, the entire response column is typically nonnormal because the different groups in the data have different means.

35.5 Conceptual Overview

If the means for the three populations are equal, we would expect the three sample means to be close together. The more the sample means differ, the stronger the evidence we have for the conclusion that the population means differ. In other words, if the variability among the sample means is “small,” it supports \({H_0}\); if the variability among the sample means is “large,” it supports \({H_a}\).

If the null hypothesis is true, we can use the variability among the sample means to develop an estimate of \({\sigma}^2\).

First, note that if the assumptions for analysis of variance are satisfied and the null hypothesis is true, each sample will have come from the same normal distribution with mean \({\mu}\) and variance \({\sigma}^2\).

Recall that the sampling distribution of the sample mean \({\overline{x}}\) for a simple random sample of size \({n}\) from a normal population will be normally distributed with mean \({\mu}\) and variance \({\sigma}_{{\overline{x}}}^2 = \frac{{\sigma}^2}{n}\).

In this case, the mean and variance of the three sample mean values \(\{{\overline{x}}_1 = 62, {\overline{x}}_1 = 66, {\overline{x}}_1 = 52\}\) can be used to estimate the mean and variance of the sampling distribution.

When the sample sizes are equal, as in this example, the best estimate of the mean of the sampling distribution of \({\overline{x}}\) is the mean or average of the sample means.

In this example, an estimate of the mean of the sampling distribution of \({\overline{x}}\) is \((62 + 66 + 52)/3 = 60\). We refer to this estimate as the overall sample mean. Refer equation (35.5)

An estimate of the variance of the sampling distribution of \({\overline{x}}\), \({\sigma}_{{\overline{x}}}^2\), is provided by the variance of the three sample means.

\[{s}_{\overline{x}}^2 = \frac{(62 - 60)^2 + (66 - 60)^2 + (52 - 60)^2}{3 - 1} = 52\] Because \({\sigma}^2 = n {\sigma}_{{\overline{x}}}^2\), the estimate can be given by

\[E_{{\sigma}^2} = n E_{{\sigma}_{{\overline{x}}}^2} = n {s}_{\overline{x}}^2 = 5 * 52 = 260\]

The \(n {s}_{\overline{x}}^2\) is referred as between-treatments estimate of \({\sigma}^2\). It is based on the assumption that the null hypothesis is true. In this case, each sample comes from the same population, and there is only one sampling distribution of \({\overline{x}}\).

In contrast, when the population means are not equal, the between-treatments estimate will overestimate the population variance \({\sigma}^2\).

The variation within each of the samples also has an effect on the conclusion we reach in analysis of variance. When a simple random sample is selected from each population, each of the sample variances provides an unbiased estimate of \({\sigma}^2\). Hence, we can combine or pool the individual estimates of \({\sigma}^2\) into one overall estimate. The estimate of \({\sigma}^2\) obtained in this way is called the pooled or within-treatments estimate of \({\sigma}^2\).

Because each sample variance provides an estimate of \({\sigma}^2\) based only on the variation within each sample, the within-treatments estimate of \({\sigma}^2\) is not affected by whether the population means are equal. When the sample sizes are equal, the within-treatments estimate of \({\sigma}^2\) can be obtained by computing the average of the individual sample variances \(\{27.5, 26.5, 31\}\).

For this exmple we obtain:

\[\text{Within-treatments estimate of } {\sigma}^2 = \frac{27.5 + 26.5 + 31}{3} = 28.33\] Remember, that the between-treatments approach provides a good estimate of \({\sigma}^2\) only if the null hypothesis is true; if the null hypothesis is false, the between-treatments approach overestimates \({\sigma}^2\). The within-treatments approach provides a good estimate of \({\sigma}^2\) in either case.

Thus, if the null hypothesis is true, the two estimates will be similar and their ratio will be close to 1.

If the null hypothesis is false, the between-treatments estimate will be larger than the within-treatments estimate, and their ratio will be large.

35.6 ANOVA

Analysis of Variance and the Completely Randomized Design

Definition 35.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)

where \({\mu}_j\) is mean of the \(j^{\text{th}}\) population. We assume that a simple random sample of size \({n}_j\) has been selected from each of the \({k}\) populations or treatments. For the resulting sample data, let

\[\begin{align} {x}_{ij} &= \text{value of observation } i \text{ for treatment } j \\ {n}_{j} &= \text{number of observations for treatment } j \\ {\overline{x}}_{j} &= \text{sample mean for treatment } j \\ {s}_{j}^2 &= \text{sample variance for treatment } j \\ {s}_{j} &= \text{sample e standard deviation for treatment } j \end{align}\]

The formulas for the sample mean and sample variance for treatment \({j}\) are given in equations (35.1) and (35.2)

\[{\overline{x}}_j = \frac{\sum_{i=1}^{n_j}{x}_{ij}}{{n}_j} \tag{35.1}\]

\[{s}_j^2 = \frac{\sum_{i=1}^{n_j}{\left({x}_{ij} - {\overline{x}}_j\right)^2}}{{n}_j - 1} \tag{35.2}\]

The overall sample mean, denoted \({\overline{\overline{x}}}\), is the sum of all the observations divided by the total number of observations.

\[{\bar{\bar{x}}} = \frac{\sum_{j=1}^k{\sum_{i=1}^{{n}_j}{{x}_{ij}}}}{{n}_T} \tag{35.3}\]

Where

\[{n}_T = {n}_1 + {n}_2 + \cdots + {n}_k \tag{35.4}\]

If the size of each sample is {n}, the equation (35.4) becomes \({n}_T = kn\), and the equation (35.3) reduces to (35.5)

\[{\bar{\bar{x}}} = \frac{\sum_{j=1}^k{{\overline{x}}_{j}}}{k} \tag{35.5}\]

Thus, whenever the sample sizes are the same, the overall sample mean is just the average of the \({k}\) sample means.

Thus, in the example, from (35.5), \({\bar{\bar{x}}} = \frac{62 + 66 + 52}{3} = 60\)

35.7 MSTR

Between-Treatments Estimate of Population Variance

The between-treatments estimate of \({\sigma}^2\) is called the mean square due to treatments and is denoted \(\text{MSTR}\). Refer equation (35.6)

\[\text{MSTR} = \frac{\text{SSTR}}{k - 1} = \frac{\sum_{j=1}^{k}{n}_j\left({\overline{x}}_j - {\bar{\bar{x}}} \right)^2}{k - 1} \tag{35.6}\]

The numerator in equation (35.6) is called the sum of squares due to treatments and is denoted \(\text{SSTR}\). The denominator, \({k − 1}\), represents the degrees of freedom associated with SSTR. Refer equation (35.7)

\[\text{SSTR} = \sum_{j=1}^{k}{n}_j\left({\overline{x}}_j - {\bar{\bar{x}}} \right)^2 \tag{35.7}\]

If \({H_0}\) is true, MSTR provides an unbiased estimate of \({\sigma}^2\). However, if the means of the \({k}\) populations are not equal, MSTR is not an unbiased estimate of \({\sigma}^2\); in fact, in that case, MSTR should overestimate \({\sigma}^2\).

In the example:

  • (35.7) \(\text{SSTR} = 5(62 - 60)^2 + 5(66 - 60)^2 + 5(52 - 60)^2 = 520\)
  • (35.6) \(\text{MSTR} = \frac{520}{3 - 1} = 260\)

35.8 MSE

Within-Treatments Estimate of Population Variance

The within-treatments estimate of \({\sigma}^2\) is called the mean square due to error and is denoted \(\text{MSE}\). Refer equation (35.8)

\[\text{MSE} = \frac{\text{SSE}}{{n}_T - k} = \frac{\sum_{j=1}^{k}{({n}_j - 1){s}_j^2}}{{n}_T - k} \tag{35.8}\]

The numerator in equation (35.8) is called the sum of squares due to error and is denoted \(\text{SSE}\). The denominator, \({{n}_T - k}\) is referred to as the degrees of freedom associated with SSE. Refer equation (35.9)

\[\text{SSE} = \sum_{j=1}^{k}{({n}_j - 1){s}_j^2} \tag{35.9}\]

In the example:

  • (35.9) \(\text{SSE} = (5 - 1)27.5 + (5 - 1)26.5 + (5 - 1)31 = 340\)
  • (35.8) \(\text{MSE} = \frac{340}{15 - 3} = 28.33\)

35.9 F test

Comparing the Variance Estimates

If the null hypothesis is true, \(\text{MSTR}\) and \(\text{MSE}\) provide two independent, unbiased estimates of \({\sigma}^2\).

Refer Variance.

We know that for normal populations, the sampling distribution of the ratio of two independent estimates of \({\sigma}^2\) follows an F distribution. Hence, if the null hypothesis is true and the ANOVA assumptions are valid, the sampling distribution of \(\frac{\text{MSTR}}{\text{MSE}}\) is an F distribution with numerator degrees of freedom equal to \({k - 1}\) and denominator degrees of freedom equal to \({{n}_T - k}\).

In other words, if the null hypothesis is true, the value of MSTR/MSE should appear to have been selected from this F distribution. However, if the null hypothesis is false, the value of \(\frac{\text{MSTR}}{\text{MSE}}\) will be inflated because MSTR overestimates \({\sigma}^2\). Hence, we will reject \({H_0}\) if the resulting value of \(\frac{\text{MSTR}}{\text{MSE}}\) appears to be too large to have been selected from an F distribution with \({k - 1}\) numerator degrees of freedom and \({{n}_T - k}\) denominator degrees of freedom.

Because the decision to reject \({H_0}\) is based on the value of \(\frac{\text{MSTR}}{\text{MSE}}\), the test statistic used to test for the equality of \({k}\) population means is given by equation (35.10)

Test Statistic for the Equality of \({k}\) Population Means :

\[F = \frac{\text{MSTR}}{\text{MSE}} \tag{35.10}\]

Because we will only reject the null hypothesis for large values of the test statistic, the p-value is the upper tail area of the F distribution to the right of the test statistic \({F}\).

  • aov()
    • Terms
      • Df degrees of freedom
        • for the independent variable (levels - 1)
        • and for the residuals (total observations - 1 - levels)
      • Sum Sq shows the sum of squares
        • SSTR: total variation between the group means
        • SSE: and the overall mean
      • Mean Sq shows the mean of the sum of squares
        • MSTR (Between): sum of squares / degrees of freedom for each parameter
        • MSE (Within): mean square of the residuals
      • F-value is the test statistic from the F test.
        • Mean square of each independent variable / mean square of the residuals.
        • The larger the F value, the more likely it is that the variation caused by the independent variable is real and not due to chance.
      • Pr(>F) is the p-value of the F-statistic.
        • likelihood that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true.
  • In the Example
    • Total Variance = Between or MSTR + Within or MSE
    • First Line (Column): \(\text{DOF}_{(k-1)} = 2, \text{SSTR} = 520, \text{MSTR} = 260\)
    • Residuals (Within) : \(\text{DOF}_{(n-k)} = 12, \text{SSE} = 340, \text{MSE} = 28.33\)
    • (35.10) \(F = \frac{\text{MSTR}}{\text{MSE}} = 9.18\)
  • Calculate \({}^U\!P_{(F)}\)
    • \({}^U\!P_{F = 9.18} = 0.0038\)
      • pf(q = 9.18, df1 = 2, df2 = 12, lower.tail = FALSE) \(\#\mathcal{R}\)
  • Compare with \({\alpha} = 0.05\)
    • \({}^U\!P_{(F)} < {\alpha} \to {H_0}\) is rejected i.e. the means are different
    • The test provides sufficient evidence to conclude that the means of the three populations are not equal.
    • Analysis of variance supports the conclusion that the population mean number of units produced per week for the three assembly methods are not equal

35.10 ANOVA Table

The sum of squares associated with the source of variation referred to as “Total” is called the total sum of squares (SST). SST divided by its degrees of freedom \({n}_T - 1\) is nothing more than the overall sample variance that would be obtained if we treated the entire set of 15 observations as one data set. With the entire data set as one sample. Refer equation (35.11)

\[\text{SST} = \text{SSTR} + \text{SSE} = \sum_{j=1}^k{\sum_{i=1}^{{n}_j}{\left( {x}_{ij} - \bar{\bar{x}}\right)^2}} \tag{35.11}\]

The degrees of freedom associated with this total sum of squares is the sum of the degrees of freedom associated with the sum of squares due to treatments and the sum of squares due to error i.e. \({n}_T - 1 = (k - 1) + ({n}_T - k)\).

In other words, SST can be partitioned into two sums of squares: the sum of squares due to treatments and the sum of squares due to error. Note also that the degrees of freedom corresponding to SST, \({n}_T - 1\), can be partitioned into the degrees of freedom corresponding to SSTR, \(k - 1\), and the degrees of freedom corresponding to SSE, \({n}_T - k\).

The analysis of variance can be viewed as the process of partitioning the total sum of squares and the degrees of freedom into their corresponding sources: treatments and error. Dividing the sum of squares by the appropriate degrees of freedom provides the variance estimates, the F value, and the p-value used to test the hypothesis of equal population means.

The square root of MSE provides the best estimate of the population standard deviation \({\sigma}\). This estimate of \({\sigma}\) on the computer output is Pooled StDev.

35.11 Code

str(aa)
## tibble [5 x 3] (S3: tbl_df/tbl/data.frame)
##  $ A: num [1:5] 58 64 55 66 67
##  $ B: num [1:5] 58 69 71 64 68
##  $ C: num [1:5] 48 57 59 47 49
bb <- aa %>% pivot_longer(everything(), names_to = "key", values_to = "value")
# 
# #ANOVA
ii_aov <- aov(formula = value ~ key, data = bb)
#names(ii_aov)
#ii_aov
#
# #
model.tables(ii_aov, type = "means")
## Tables of means
## Grand mean
##    
## 60 
## 
##  key 
## key
##  A  B  C 
## 62 66 52
#
# #Summary
summary(ii_aov)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## key          2    520  260.00   9.176 0.00382 **
## Residuals   12    340   28.33                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

35.12 Summary

Analysis of variance (ANOVA) can be used to test for differences among means of several populations or treatments.

The completely randomized design and the randomized block design are used to draw conclusions about differences in the means of a single factor. The primary purpose of blocking in the randomized block design is to remove extraneous sources of variation from the error term. Such blocking provides a better estimate of the true error variance and a better test to determine whether the population or treatment means of the factor differ significantly.

The basis for the statistical tests used in analysis of variance and experimental design is the development of two independent estimates of the population variance \({\sigma}^2\). In the single-factor case, one estimator is based on the variation between the treatments; this estimator provides an unbiased estimate of \({\sigma}^2\) only if the means \(\{{\mu}_1, {\mu}_2, \ldots, {\mu}_k\}\) are all equal. A second estimator of \({\sigma}^2\) is based on the variation of the observations within each sample; this estimator will always provide an unbiased estimate of \({\sigma}^2\).

By computing the ratio of these two estimators (the F statistic), it is determined whether to reject the null hypothesis that the population or treatment means are equal.

In all the experimental designs considered, the partitioning of the sum of squares and degrees of freedom into their various sources enabled us to compute the appropriate values for the analysis of variance calculations and tests.

Further, Fisher LSD procedure and the Bonferroni adjustment can be used to perform pairwise comparisons to determine which means are different.

Validation


36 Simple Linear Regression

36.1 Overview

  • Larose Chapter 8 (338) : “Simple Linear Regression” has been merged here.

36.2 Simple Linear Regression Model

Definition 36.1 Regression analysis can be used to develop an equation showing how two or more variables are related.
Definition 36.2 The variable being predicted is called the dependent variable \(({y})\).
Definition 36.3 The variable or variables being used to predict the value of the dependent variable are called the independent variables \(({x})\).
Definition 36.4 The simplest type of regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line, is called simple linear regression.
Definition 36.5 The equation that describes how \({y}\) is related to \({x}\) and an error term \(\epsilon\) is called the regression model. For example, simple linear regression model is given by equation \({y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon}\)

\[{y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon} \tag{36.1}\]

Note

  • \({\beta}_0\) and \({\beta}_1\) are referred to as the parameters of the model
Definition 36.6 The random variable, error term \(({\epsilon})\), accounts for the variability in \({y}\) that cannot be explained by the linear relationship between \({x}\) and \({y}\).
Definition 36.7 The equation that describes how the mean or expected value of \({y}\), denoted \(E(y)\), is related to \({x}\) is called the regression equation. Simple Linear Regression Equation is: \(E(y) = {\beta}_0 + {\beta}_1 {x}\). The graph of the simple linear regression equation is a straight line; \({\beta}_0\) is the y-intercept of the regression line, \({\beta}_1\) is the slope.

36.3 Least Squares Method

Definition 36.8 Sample statistics (denoted \(b_0\) and \(b_1\)) are computed as estimates of the population parameters \({\beta}_0\) and \({\beta}_1\). Thus Estimated Simple Linear Regression Equation is: \(\hat{y} = b_0 + b_1 {x}\). The value of \(\hat{y}\) provides both a point estimate of \(E(y)\) for a given value of ‘x’ and a prediction of an individual value of ‘y’ for a given value of ‘x.’
Definition 36.9 The least squares method is a procedure for using sample data to find the estimated regression equation. It uses the sample data to provide the values of \(b_0\) and \(b_1\) that minimize the sum of the squares of the deviations between the observed values of the dependent variable \(y_i\) and the predicted values of the dependent variable \(\hat{y}_i\). i.e. min\(\Sigma(y_i - \hat{y}_i)^2 or min(SSE)\)
  • Scatter diagrams for regression analysis are constructed with the independent variable ‘x’ on the horizontal axis and the dependent variable ‘y’ on the vertical axis.

“ForLater” - Equation and calculation for \(b_0\) and \(b_1\)

36.4 Coefficient of Determination

Definition 36.10 The deviations of the y values about the estimated regression line are called residuals. The \(i^{\text{th}}\) residual represents the error in using (predicted) \(\hat{y}_i\) to estimate (observed) \(y_i\).
  • For the \(i^{\text{th}}\) observation, the difference between the observed value of the dependent variable, \(y_i\), and the predicted value of the dependent variable, \(\hat{y}_i\), is called the \(i^{\text{th}}\) residual.
Definition 36.11 The sum of squares of residuals or errors is the quantity that is minimized by the least squares method. This quantity, also known as the sum of squares due to error, is denoted by SSE. i.e. \(\text{SSE} = \Sigma(y_i - \hat{y}_i)^2\)
  • The value of SSE is a measure of the error in using the estimated regression equation to predict the values of the dependent variable in the sample.
Definition 36.12 To measure how much the \(\hat{y}\) values on the estimated regression line deviate from \(\overline{y}\), another sum of squares is computed. This sum of squares, called the sum of squares due to regression, is denoted SSR. i.e. \(\text{SSR} = \Sigma(\hat{y}_i - \overline{y})^2\)
Definition 36.13 For the \(i^{\text{th}}\) observation in the sample, the difference \(y_i - \overline{y}\) provides a measure of the error involved in using \(\overline{y}\) for prediction. The corresponding sum of squares, called the total sum of squares, is denoted SST. i.e. \(\text{SST} = \Sigma(y_i - \overline{y})^2 \to \text{SST} = \text{SSE} + \text{SSR}\). SST is a measure of the total variability in the values of the response variable alone, without reference to the predictor.
  • We can think of SST as a measure of how well the observations cluster about the \(\overline{y}\) line and SSE as a measure of how well the observations cluster about the \(\hat{y}\) line.
  • The estimated regression equation would provide a perfect fit if every value of the dependent variable \(y_i\) happened to lie on the estimated regression line.
    • In this case, \(y_i - \hat{y}_i\) would be zero for each observation, resulting in SSE = 0.
    • Thus for a perfect fit SSR must equal SST, and the ratio (SSR/SST) must equal one.
Definition 36.14 The ratio \(r^2 =\frac{\text{SSR}}{\text{SST}} \in [0, 1]\), is used to evaluate the goodness of fit for the estimated regression equation. This ratio is called the coefficient of determination (\(r^2\)). It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation.
  • \(r^2\) can be interpreted as the percentage of the total sum of squares that can be explained by using the estimated regression equation.
  • Larger values of \(r^2\) imply that the least squares line provides a better fit to the data; that is, the observations are more closely grouped about the least squares line. But, using only \(r^2\), we can draw no conclusion about whether the relationship between x and y is statistically significant.

36.5 Correlation Coefficient

25.26 Correlation coefficient is a measure of linear association between two variables that takes on values between -1 and +1. Values near +1 indicate a strong positive linear relationship; values near -1 indicate a strong negative linear relationship; and values near zero indicate the lack of a linear relationship.

  • If a regression analysis has already been performed and the coefficient of determination \(r^2\) computed, the sample correlation coefficient \(r_{xy} = (\text{sign of } b_1)\sqrt{r^2}\)
  • In the case of a linear relationship between two variables, both the coefficient of determination \((r^2)\) and the sample correlation coefficient \((r_{xy})\) provide measures of the strength of the relationship.
    • \((r^2) \in [0, 1]\) : The coefficient of determination provides a measure between zero and one
    • \((r_{xy}) \in [-1, 1]\) : The sample correlation coefficient provides a measure between −1 and +1.
    • Although \(r_{xy}\) is restricted to a linear relationship between two variables, \(r^2\) can be used for nonlinear relationships and for relationships that have two or more independent variables.
    • Thus, the coefficient of determination provides a wider range of applicability.

36.6 Model Assumptions

  • Value of the coefficient of determination \((r^2)\) is a measure of the goodness of fit of the estimated regression equation. However, even with a large value of \(r^2\), the estimated regression equation should not be used until further analysis of the appropriateness of the assumed model has been conducted.
  • An important step in determining whether the assumed model is appropriate involves testing for the significance of the relationship.
  • The tests of significance in regression analysis are based on the following assumptions about the error term \(\epsilon\).
Definition 36.15 Regression Assumption 1/4 (Zero-Mean): For Regression Model \({y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon}\) : The error term \(\epsilon\) is a random variable with a mean or expected value of zero; \(E(\epsilon) = 0\). (Implication) \(\beta_0\) and \(\beta_1\) are constants, therefore \(E(\beta_0) = \beta_0\) and \(E(\beta_1) = \beta_1\); thus, for a given value of x, the expected value of y is given by Regression equation \(E(y) = {\beta}_0 + {\beta}_1 {x}\)
Definition 36.16 Regression Assumption 2/4 (Constant Variance): For Regression Model \({y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon}\) and Regression equation \(E(y) = {\beta}_0 + {\beta}_1 {x}\) : The variance of \(\epsilon\), denoted by \({\sigma}^2\), is the same for all values of x. (Implication) The variance of y about the regression line equals \({\sigma}^2\) and is the same for all values of x.
Definition 36.17 Regression Assumption 3/4 (Independence): For Regression Model \({y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon}\) and Regression equation \(E(y) = {\beta}_0 + {\beta}_1 {x}\) : The values of \(\epsilon\) are independent. (Implication) The value of \(\epsilon\) for a particular value of x is not related to the value of \(\epsilon\) for any other value of x; thus, the value of y for a particular value of x is not related to the value of y for any other value of x.
Definition 36.18 Regression Assumption 4/4 (Normality): For Regression Model \({y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon}\) and Regression equation \(E(y) = {\beta}_0 + {\beta}_1 {x}\) : The error term \(\epsilon\) is a normally distributed random variable for all values of x. (Implication) Because y is a linear function of \(\epsilon\), y is also a normally distributed random variable for all values of x.
Definition 36.19 Four Regression Assumptions: (1) Zero-Mean: \(E(\epsilon) = 0\). (2) Constant Variance: The variance of \(\epsilon\) (\({\sigma}^2\)) is the same for all values of x. (3) Independence: The values of \(\epsilon\) are independent. (4) Normality: The error term \(\epsilon\) has a normal distribution.
  • Caution: We are also making an assumption or hypothesis about the form of the relationship between x and y. That is, we assume that a straight line represented by \({\beta}_0 + {\beta}_1 {x}\) is the basis for the relationship between the variables. We must not lose sight of the fact that some other model, for instance \({y} = {\beta}_0 + {\beta}_1 {x}^2 + {\epsilon}\), may turn out to be a better model for the underlying relationship.

36.7 Testing for Significance

36.15 Regression Assumption 1/4 (Zero-Mean): For Regression Model \({y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon}\) : The error term \(\epsilon\) is a random variable with a mean or expected value of zero; \(E(\epsilon) = 0\). (Implication) \(\beta_0\) and \(\beta_1\) are constants, therefore \(E(\beta_0) = \beta_0\) and \(E(\beta_1) = \beta_1\); thus, for a given value of x, the expected value of y is given by Regression equation \(E(y) = {\beta}_0 + {\beta}_1 {x}\)

  • If \({\beta}_1 = 0 \to E(y) = {\beta}_0\) : In this case, the mean value of y does not depend on the value of x and hence we would conclude that x and y are not linearly related.
  • Alternatively, if \({\beta}_1 \neq 0\), we would conclude that the two variables are related.
  • Thus, to test for a significant regression relationship, we must conduct a hypothesis test to determine whether the value of \({\beta}_1\) is zero. i.e. \({H_0} : {\beta}_1 = 0\)
  • Two tests are commonly used. Both require an estimate of \({\sigma}^2\), the variance of \(\epsilon\) in the regression model.

36.16 Regression Assumption 2/4 (Constant Variance): For Regression Model \({y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon}\) and Regression equation \(E(y) = {\beta}_0 + {\beta}_1 {x}\) : The variance of \(\epsilon\), denoted by \({\sigma}^2\), is the same for all values of x. (Implication) The variance of y about the regression line equals \({\sigma}^2\) and is the same for all values of x.

  • Estimate of \({\sigma}^2\)
    • \({\sigma}^2\), the variance of \(\epsilon\), also represents the variance of the y values about the regression line.

36.10 The deviations of the y values about the estimated regression line are called residuals. The \(i^{\text{th}}\) residual represents the error in using (predicted) \(\hat{y}_i\) to estimate (observed) \(y_i\).

  • Thus, SSE, the sum of squared residuals, is a measure of the variability of the actual observations about the estimated regression line.
    • SSE has \((n − 2)\) degrees of freedom because two parameters (\({\beta}_0\) and \({\beta}_1\)) must be estimated to compute SSE.
Definition 36.20 Mean-Squared error (MSE) is an evaluating measure of accuracy of model estimation for a continuous target variable. It provides the estimate of \({\sigma}^2\). It is given by SSE divided by its degrees of freedom \((n - 2)\). i.e. \(s^2 = \text{MSE} = \frac{\text{SSE}}{n - 2}\). Where ‘s’ is the standard error of the estimate. Lower MSE is preferred.

36.8 Inference in Regression

36.8.1 t-Test

If we use a different random sample for the same regression study the resultant regression would be obviously different from the earlier. Indeed, \(b_0\) and \(b_1\), the least squares estimators, are sample statistics with their own sampling distributions.

Definition 36.21 Standard deviation of \(b_1\) is \({\sigma}_{b_1}\). Its estimate, estimated standard deviation of \(b_1\), is given by \(s_{b_1} = \frac{s}{\sqrt{\Sigma (x_i - {\overline{x}})^2}}\). The standard deviation of \(b_1\) is also referred to as the standard error of \(b_1\). Thus, \(s_{b_1}\) provides an estimate of the standard error of \(b_1\).
  • The t test for a significant relationship is based on the fact that the test statistic \(\frac{b_1 - \beta_1}{s_{b_1}}\) follows a t distribution with \((n − 2)\) degrees of freedom. If the null hypothesis is true, then \(\beta_1 = 0\) and \(t = \frac{b_1}{s_{b_1}}\)
    • If \({}^2\!P_{(t)} \leq {\alpha} \to {H_0}\) Rejected.
    • The form of a confidence interval for \(\beta_1\) is as follows: \(b_1 \pm t_{{\alpha}/2} s_{b_1}\)
    • Large values of \(s_{b_1}\) indicate that the estimate of the slope \(b_1\) is unstable, while small values of \(s_{b_1}\) indicate that the estimate of the slope \(b_1\) is precise.
Definition 36.22 \(\text{\{Test for Significance in Simple Linear Regression\} } {H_0} : {\beta}_1 = 0 \iff {H_a}: {\beta}_1 \neq 0\)

36.8.2 F-Test

An F test, based on the F probability distribution, can also be used to test for significance in regression. With only one independent variable, the F test will provide the same conclusion as the t test; that is, if the t test indicates \(b_1 \neq 0\) and hence a significant relationship, the F test will also indicate a significant relationship. But with more than one independent variable, only the F test can be used to test for an overall significant relationship.

As shown earlier, MSE provides an estimate of \({\sigma}^2\). If the null hypothesis \({H_0} : {\beta}_1 = 0\) is true, the sum of squares due to regression, SSR, divided by its degrees of freedom provides another independent estimate of \({\sigma}^2\).

Definition 36.23 The mean square due to regression (MSR) provides the estimate of \({\sigma}^2\). It is given by SSR divided by its degrees of freedom. If the standard error of the estimate is denoted by ‘s’ then \(s^2 = \text{MSR} = \frac{\text{SSR}}{\text{Regression degrees of freedom}} = \frac{\text{SSR}}{\text{Number of independent variables}}\)
  • If the null hypothesis \({H_0} : {\beta}_1 = 0\) is true, MSR and MSE are two independent estimates of \({\sigma}^2\) and the sampling distribution of MSR/MSE follows an F distribution with numerator degrees of freedom equal to one and denominator degrees of freedom equal to \((n − 2)\).
    • Therefore, when \({\beta}_1 = 0\), the value of MSR/MSE should be close to one.
      • Both MSE and MSR provide unbiased estimates of \({\sigma}^2\)
    • However, if the null hypothesis is false \({\beta}_1 \neq 0\), MSR will overestimate \({\sigma}^2\) and the value of MSR/MSE will be inflated; thus, large values of MSR/MSE lead to the rejection of \({H_0}\) and the conclusion that the relationship between x and y is statistically significant.
      • MSE still provides an unbiased estimate of \({\sigma}^2\)
    • Test Statistic \(F = \frac{\text{MSR}}{\text{MSE}}\)
    • If \(P_{(F)} \leq {\alpha} \to {H_0}\) Rejected.

36.22 \(\text{\{Test for Significance in Simple Linear Regression\} } {H_0} : {\beta}_1 = 0 \iff {H_a}: {\beta}_1 \neq 0\)

36.9 Cautions About the Interpretation of Significance Tests

Regression analysis, which can be used to identify how variables are associated with one another, cannot be used as evidence of a cause-and-effect relationship.

Rejecting the null hypothesis \({H_0} : {\beta}_1 = 0\) and concluding that the relationship between x and y is significant does not enable us to conclude that a cause-and-effect relationship is present between x and y.

Concluding a cause-and-effect relationship is warranted only if the analyst can provide some type of theoretical justification that the relationship is in fact causal.

In addition, just because we are able to reject \({H_0} : {\beta}_1 = 0\) and demonstrate statistical significance does not enable us to conclude that the relationship between x and y is linear.

We can state only that x and y are related and that a linear relationship explains a significant portion of the variability in y over the range of values for x observed in the sample.

Given a significant relationship, we should feel confident in using the estimated regression equation for predictions corresponding to x values within the range of the x values observed in the sample. Unless other reasons indicate that the model is valid beyond this range, predictions outside the range of the independent variable should be made with caution.

i.e. if sample has x-value 2 to 26, we can use it to estimate for a value of x = 20 but it should not be extrapolated to x = 40.

36.10 Using the Estimated Regression Equation for Estimation and Prediction

“ForLater” - Interval Estimation, Confidence Interval, Prediction Interval

36.11 Residual Analysis

Residual analysis is the primary tool for determining whether the assumed regression model is appropriate.

36.19 Four Regression Assumptions: (1) Zero-Mean: \(E(\epsilon) = 0\). (2) Constant Variance: The variance of \(\epsilon\) (\({\sigma}^2\)) is the same for all values of x. (3) Independence: The values of \(\epsilon\) are independent. (4) Normality: The error term \(\epsilon\) has a normal distribution.

  • Common Plots
    • Residual Plot Against x (Scatterplot of \(x_i\) and \(\{y_i - \hat{y}_i\}\))
    • Residual Plot Against the fits (predicted values) \(\hat{y}\) (Scatterplot of \(\hat{y}_i\) and \(\{y_i - \hat{y}_i\}\))
      • It is more widely used in multiple regression analysis, because of the presence of more than one independent variable.
      • There should be no deviant pattern observed for model to remain valid
        • No Curvature - violates the independence assumption
        • No Funnel - violates the constant variance assumption
        • No directional change - violates the zero-mean assumption
      • “Rorschach effect” : Do not see pattern in randomness
        • The null hypothesis when examining these plots is that the assumptions are intact; only systematic and clearly identifiable patterns in the residuals plots offer evidence to the contrary.
    • Standardized Residuals (Scaled) Plot Against x
    • Normal Probability Plot (QQ plot) - Standardised Residuals vs. Normal Scores
Definition 36.24 A normal probability plot is a quantile-quantile plot of the quantiles of a particular distribution against the quantiles of the standard normal distribution, for the purposes of determining whether the specified distribution deviates from normality.
  • Normal Probability Plot (QQ plot)
    • In a normality plot, the observed values of the distribution of interest are compared against the same number of values that would be expected from the normal distribution.
    • If the distribution is normal, then the bulk of the points in the plot should fall on a straight line; systematic deviations from linearity in this plot indicate non-normality.
  • Tests
    • Anderson-Darling Test for Normality
      • The null hypothesis is that the normal distribution fits, so that small p-values will indicate lack of fit.
    • For assessing whether the constant variance assumption has been violated, either Bartlett test or Levene test may be used.
    • For determining whether the independence assumption has been violated, either the Durban-Watson test or the runs test may be applied.

36.12 Outliers

25.24 Outliers are data points or observations that does not fit the trend shown by the remaining data. These differ significantly from other observations. Unusually large or small values are commonly found to be outliers.

  • Refer Outliers: C03
    • An outlier is an observation that has a very large standardized residual (scaled) in absolute value.
Definition 36.25 High leverage points are observations with extreme values for the independent variables. The leverage of an observation is determined by how far the values of the independent variables are from their mean values.
  • A high leverage point is an observation that is extreme in the predictor space.
    • For leverage only the x is considered not the y.
    • Example: we have Distance Travelled (y) vs. Time Taken (x) information of 10 people with max(x) is 9 hours. If another person travel 39 km in 16 hours, it automatically becomes a high leverage point solely based on 16 hours (x).
      • If the point lies on the regression line i.e. its standardised residual is low then it is not an outlier. The decision for designating it as outlier conside the distance travelled (y)
Definition 36.26 Influential observations are those observations which have a strong influence or effect on the regression results. Influential observations can be identified from a scatter diagram when only one independent variable is present.
  • An observation is influential if the regression parameters alter significantly based on the presence or absence of the observation in the data set.
    • An outlier may or may not be influential. Similarly, a high leverage point may or may not be influential.
    • Usually, influential observations combine both the characteristics of large residual and high leverage
    • It is possible for an observation to be not-quite flagged as an outlier, and not-quite flagged as a high leverage point, but still be influential through the combination of the two characteristics.
      • Influential observations that are caused by an interaction of large residuals and high leverage can be difficult to detect. One of the diagnostic procedure is called ‘Cook D statistic.’
    • Example: Suppose another person travels 20 km in 5 hours when the mean(x) is 5 hours. i.e. it is situated exactly at the mean of independent variable
      • Altohugh this would be an outlier because it has large standardised residual
      • It is not influential because it has very low leverage (placed at exactly the mean of x)
        • Its presence or absence are going to change parameters of regression equation only by a small value.
Definition 36.27 Cook distance (\(D_i\)) is the most common measure of the influence of an observation. It works by taking into account both the size of the residual and the amount of leverage for that observation. Generally an observation is influential is if \(D_i > 1\)
  • \(D_i\) can be compared against the percentiles of the F-distribution with (m, n − m − 1) degrees of freedom.
    • ‘n’ is the total number of observations
    • ‘m’ indicates the number of predictor variables
    • If the observed value lies within the first quartile of this distribution (lower than the \(25^{\text{th}}\) percentile), then the observation has little influence on the regression;
    • However, if \(D_i\) is greater than the median of this distribution, then the observation is influential.
    • The hiker in the earlier example (39 km in 16 hours), was that observation influential
      • This observatio has high leverage however it is not an outlier because the observation lies near the regression line
      • It has low \(D_i\) and thus the observation is not influential
    • What if the hiker travelled 23 km in 10 hours
      • It lacks high leverage or high residual
      • However, it is influential because \(D_i\) is beyond \(50^{\text{th}}\) percentile
      • The influence of this observation stems from the combination of its moderately large residual with its moderately large leverage.

36.13 Transformations to achieve Linearity

  • Ladder of Re-expressions - “ForLater”
    • The ladder of re-expressions consists of the following ordered set of transformations for any continuous variable t: \(t^{-3}, t^{-2}, t^{-1}, t^{-1/2}, \ln(t), \sqrt{t}, t^1, t^2, t^3\)
  • Box-Cox Transformations
    • This method involves first choosing a set of candidate values for \(\lambda\), and finding SSE for regressions performed using each value of \(\lambda\). Then, plotting \(\text{SSE}_\lambda\) versus \(\lambda\), find the lowest point of a curve through the points in the plot. This represents the maximum-likelihood estimate of \(\lambda\).

Validation


37 Multiple Regression

37.1 Overview

  • Larose Chapter 9 (339) : “Multiple Regression and Model Building” has been merged here.
Definition 37.1 Multiple regression analysis is the study of how a dependent variable \(y\) is related to two or more independent variables. Multiple Regression Model is \({y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon\)
Definition 37.2 The equation that describes how the mean or expected value of \({y}\), denoted \(E(y)\), is related to \({x}\) is called the regression equation. Multiple Regression Linear Equation is: \(E(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p\).
  • Note: Regression Model \({y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon\) contains error term \(\epsilon\), whereas the Regression Equation \(E(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p\) does not have that.
    • \(E(y)\) represents the average of all possible values of y that might occur for the given values of \(\{x_1, x_2, \ldots, x_p\}\).
  • Model parameters \(\{\beta_0, \beta_1, \beta_2, \ldots, \beta_p\}\) are generally unknown and thus are estimated by sample statistics \(\{b_0, b_1, b_2, \ldots, b_p\}\).
Definition 37.3 Estimated Multiple Regression Equation is given by \(\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_p x_p\). Where \(b_i\) represents an estimate of the change in y corresponding to a one-unit change in \(x_i\) when all other independent variables are held constant.

37.2 SST, SSR, SSE and MST, MSR, MSE

  • Relationship between SST, SSR, SSE
    • \(\text{SST} = \text{SSR} + \text{SSE}\)
    • Total Sum of Squares \(\text{SST} = \sum(y_i - \overline{y})^2\)
      • Degrees of Freedom = \((n - 1)\)
      • Mean Sum of Squares \(\text{MST} = \frac{\text{SST}}{(n - 1)}\)
    • Sum of Squares due to Regression \(\text{SSR} = \sum(\hat{y}_i - \overline{y})^2\)
      • Degrees of Freedom = \((p)\)
      • Mean square due to regression \(\text{MSR} = \frac{\text{SSR}}{(p)}\)
    • Sum of Squares due to Error \(\text{SSE} = \sum(y_i - \hat{y}_i)^2\)
      • Degrees of Freedom = \((n - p - 1)\)
      • Mean square due to error \(\text{MSE} = \frac{\text{SSE}}{(n - p - 1)}\)
    • Coefficient of Determination
      • Simple \(r^2 = \frac{\text{SSR}}{\text{SST}}\)
      • Multiple \(R^2 = \frac{\text{SSR}}{\text{SST}}\)
      • In general, \(R^2\) always increases as independent variables are added to the model.
    • F-statistic
      • \(F = \frac{\text{MSR}}{\text{MSE}}\)
      • Note that for \(r^2\) the denominator was Total i.e. SST whereas for F-statistic denominator is Error (MSE)
Definition 37.4 If a variable is added to the model, \(R^2\) becomes larger even if the variable added is not statistically significant. The adjusted multiple coefficient of determination \((R_a^2)\) compensates for the number of independent variables in the model. With ‘n’ denoting the number of observations and ‘p’ denoting the number of independent variables: \(R_a^2 = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}\)
  • Note: If the value of \(R^2\) is small and the model contains a large number of independent variables, the adjusted coefficient of determination \((R_a^2)\) can take a negative value.

37.3 Model Assumptions

36.19 Four Regression Assumptions: (1) Zero-Mean: \(E(\epsilon) = 0\). (2) Constant Variance: The variance of \(\epsilon\) (\({\sigma}^2\)) is the same for all values of x. (3) Independence: The values of \(\epsilon\) are independent. (4) Normality: The error term \(\epsilon\) has a normal distribution.

  • All 4 assumptions of Simple Linear Regression are applicable on Multiple Linear Regression also. Only the number of independent variables and model would increase.

37.4 Testing for Significance

  • In multiple regression, the t-test and the F-test have different purposes.
    • The F test is used to determine whether a significant relationship exists between the dependent variable and the set of all the independent variables; we will refer to the F-test as the test for overall significance.
      • F-Test Statistic \(F = \frac{\text{MSR}}{\text{MSE}}\)
    • If the F test shows an overall significance, the t-test is used to determine whether each of the individual independent variables is significant.
      • A separate t-test is conducted for each of the independent variables in the model
      • we refer to each of these t tests as a test for individual significance.
      • t-Test Statistic \(t = \frac{b_i}{s_{b_i}}\)
Definition 37.5 \(\text{\{F-Test in Multiple Linear Regression\} } {H_0} : {\beta}_1 = {\beta}_2 = \cdots = {\beta}_p = 0 \iff {H_a}: \text{At least one parameter is not zero}\)
Definition 37.6 \(\text{\{t-Test in Multiple Linear Regression\} } {H_0} : {\beta}_i = 0 \iff {H_a}: {\beta}_i \neq 0\)

37.5 Multicollinearity

Definition 37.7 Multicollinearity refers to the correlation among the independent variables.
  • In t-tests for the significance of individual parameters, the difficulty caused by multicollinearity is that it is possible to conclude that none of the individual parameters are significantly different from zero when an F test on the overall multiple regression equation indicates a significant relationship.
    • When the independent variables are highly correlated, it is not possible to determine the separate effect of any particular independent variable on the dependent variable.
    • Multicollinearity is a potential problem if the absolute value of the sample correlation coefficient \(r_{x_1, x_2}\) exceeds 0.7 for any two of the independent variables.

37.6 Categorical Independent Variables

  • Ex: Gender (Male, Female)
    • We would need a dummy variable or ‘indicator variable’ which will be {0 = Male, 1 = Female}
    • Let \({x_1}\) denote a numerical variable and \({x_2}\) is the dummy variable which can take 2 values {M = 0, F = 1}.
    • Multiple Regression equation would be: \(E(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2\)
      • Expected Value of Y given M : \(E(y | \text{M}) = \beta_0 + \beta_1 x_1\)
      • Expected Value of Y given F : \(E(y | \text{F}) = \beta_0 + \beta_1 x_1 + \beta_2\)
    • In effect, the use of a dummy variable provides two estimated regression equations that can be used to predict Y, each corresponding to either level of \({x_2}\).
      • These regression lines represent same slope but effectively different intercepts when Y is plotted against \({x_1}\) i.e. on each of the different graphs of Y vs. \(x_p\) there will be two lines each corresponding to either level of \({x_2}\).
  • Number of dummy variables
    • Above example had only 2 levels so it was modelled with a single dummy variable with 2 levels of {0, 1}.
    • A variable with 3 levels e.g. {low, medium, high} will NOT use a single dummy variable with 3 levels of {0, 1, 2}.
    • Rather, we would need 2 dummy variables each with 2 levels of {0, 1}.
Definition 37.8 A categorical variable with \(k\) levels must be modeled using \(k − 1\) dummy variables (or indicator variables). It can take only the values 0 and 1. e.g. A variable with 3 levels of {low, medium, high} would need 2 dummy variables \(\{x_1, x_2\}\) each being either 0 or 1 only. i.e. low \(\to \{x_1 = 1, x_2 = 0\}\), medium \(\to \{x_1 = 0, x_2 = 1\}\), high \(\to \{x_1 = 0, x_2 = 0\}\). Thus \(x_1\) is 1 when low and 0 otherwise, \(x_2\) is 1 when medium and 0 otherwise. High is represented as neither \(x_1\) nor \(x_2\) i.e. both are zero. Note that both cannot be 1. Only one of them can be TRUE at a time.
  • The category that is not assigned an indicator variable is denoted the reference category (or the Benchmark). In the example, “high” is the reference category.

“ForLater” - Studentized Deleted Residuals and Outliers

37.7 Logistic Regression

  • Generally part of Classification
  • In many regression applications the dependent variable may only assume two discrete values.
    • For instance, a bank might want to develop an estimated regression equation for predicting whether a person will be approved for a credit card.
    • The dependent variable can be coded as y = 1 if the bank approves the request for a credit card and y = 0 if the bank rejects the request for a credit card.
    • Using logistic regression we can estimate the probability that the bank will approve the request for a credit card given a particular set of values for the chosen independent variables.
  • The odds in favor of an event occurring is defined as the probability the event will occur divided by the probability the event will not occur. In logistic regression the event of interest is always y = 1.
Definition 37.9 The odds ratio measures the impact on the odds of a one-unit increase in only one of the independent variables. The odds ratio is the odds that y = 1 given that one of the independent variables has been increased by one unit \((\text{odds}_1)\) divided by the odds that y = 1 given no change in the values for the independent variables \((\text{odds}_0)\). i.e. \(\text{Odds Ratio} = \frac{\text{odds}_1}{\text{odds}_0}\)

“ForLater” - “logit”

37.8 VIF

  • Suppose we did not check for the presence of correlation among our predictors and performed the regression anyway.
  • Is there some way that the regression results can warn us of the presence of multicollinearity
    • We may ask for the variance inflation factors (VIF) to be reported
    • \(R_i^2 = 0.80 \to \text{VIF} \geq 5\) to be an indicator of moderate multicollinearity
    • \(R_i^2 = 0.90 \to \text{VIF} \geq 10\) to be an indicator of severe multicollinearity
Definition 37.10 The variance inflation factors (VIF) is given by \(\text{VIF}_i = \frac{1}{1 - R_i^2} \in [1, \infty]\). That is, the minimum value for VIF is 1, and is reached when \(x_i\) is completely uncorrelated with the remaining predictors.
  • Solutions for multicollinearity
    • Eliminate one of the variable
      • However, the variable might have some information relevant to the model
    • User-defined Composite
      • Scale both variable and take mean of the values. Use this as an independent variable instead of the correlated variables
        • However if one of the variable is an excellent predictor of the dependent variable then averaging it with a weaker predictor is going to reduce the model performance.
        • Even if we change the weights from average (50:50) to something else the problem would remain
    • PCA
      • Definitely Better

37.9 Variable Selection Methods

  • These include - “ForLater”
    • Forwards Selection
    • Backward elimination
    • Stepwise selection
    • Best Subsets
  • The Forward Selection Procedure
    • It starts with no variables in the model.
    1. For the first variable to enter the model, select the predictor most highly correlated with the target \((x_1)\).
      • If the resulting model is not significant, then stop and report that no variables are important predictors
    2. For each remaining variable, compute the sequential F-statistic for that variable, given the variables already in the model.
      • For example, in this first pass through the algorithm, these sequential F-statistics would be \(\{F(x_2|x_1), F(x_3|x_1), F(x_4|x_1), \ldots \}\).
      • On the second pass through the algorithm, these might be \(\{F(x_3|x_1, x_2), F(x_4|x_1, x_2), \ldots \}\).
      • Select the variable with the largest sequential F-statistic.
    3. For the variable selected in step 2, test for the significance of the sequential F-statistic.
      • If the resulting model is not significant, then stop, and report the current model without adding the variable from step 2.
      • Otherwise, add the variable from step 2 into the model and return to step 2.
  • The Backward Elimination Procedure
    • It starts with all the variables in the model.
    1. Perform the regression on the full model; that is, using all available variables.
      • For example, perhaps the full model has four variables, \(\{x_1, x_2, x_3, x_4 \}\).
    2. For each variable in the current model, compute the partial F-statistic.
      • In the first pass through the algorithm, these would be \(\{F(x_1|x_2, x_3, x_4), F(x_2|x_1, x_3, x_4), F(x_3|x_1, x_2, x_4), F(x_4|x_1, x_2, x_3)\}\).
      • Select the variable with the smallest partial F-statistic. Denote this value \(F_{\text{min}}\).
    3. Test for the significance of \(F_{\text{min}}\).
    • If \(F_{\text{min}}\) is not significant, then remove the variable associated with \(F_{\text{min}}\) from the model, and return to step 2.
    • If \(F_{\text{min}}\) is significant, then stop the algorithm and report the current model.
    • If this is the first pass through the algorithm, then the current model is the full model.
    • If this is not the first pass, then the current model has been reduced by one or more variables from the full model.

37.10 Stepwise Regression

Definition 37.11 In stepwise regression, the regression model begins with no predictors, then the most significant predictor is entered into the model, followed by the next most significant predictor. At each stage, each predictor is tested whether it is still significant. The procedure continues until all significant predictors have been entered into the model, and no further predictors have been dropped. The resulting model is usually a good regression model, although it is not guaranteed to be the global optimum.
  • The stepwise procedure represents a modification of the forward selection procedure.
    • A variable that has been entered into the model early in the forward selection process may turn out to be nonsignificant, once other variables have been entered into the model.
    • The stepwise procedure checks on this possibility, by performing at each step a partial F-test, using the partial sum of squares, for each variable currently in the model.
    • If there is a variable in the model that is no longer significant, then the variable with the smallest partial F-statistic is removed from the model.

Validation


38 Regression Models

38.1 Overview

    • “ForLater” - Everything

38.2 Summary

Validation


39 Time Series

39.1 Overview

  • “Time Series Analysis and Forecasting”
    • “ForLater” - Everything

39.2 Summary

Validation


40 Nonparametric Methods

40.1 Overview

40.2 Paramtetric Methods

Definition 40.1 Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).

Parametric methods mostly require quantitative data. However these are generally sometimes more powerful than nonparametric methods.

  • The reason that parametric tests are sometimes more powerful than randomisation and tests based on ranks is that the parametric tests make use of some extra information about the data: the nature of the distribution from which the data are assumed to have come.
  • Powerful here means, they require smaller sample size.
  • However, their power advantage is not invariant
  • Further, Rarely if ever a parametric test and a non-parametric test actually have the same null.
    • The parametric t-test is testing the mean of the distribution, assuming the first two moments exist.
    • The Wilcoxon rank sum test does not assume any moments, and tests equality of distributions instead.
    • The two tests are testing different hypotheses (comparable in a limited sense but different).
  • At large sample sizes, either of the parametric or the nonparametric tests work adequately.

40.3 Nonparamtetric Methods

Definition 40.2 Distribution-free methods are the Statistical methods that make no assumption about the probability distribution of the population.
Definition 40.3 Nonparametric methods are the statistical methods that require no assumption about the form of the probability distribution of the population and are often referred to as distribution free methods. Several of the methods can be applied with categorical as well as quantitative data.

Most of the statistical methods referred to as parametric methods require quantitative data, while nonparametric methods allow inferences based on either categorical or quantitative data.

  • However, the computations used in the nonparametric methods are generally done with categorical data.
    • Nominal or ordinal measures in many cases require a nonparametric test.
  • Whenever the data are quantitative, we will transform the data into categorical data in order to conduct the nonparametric test.
  • Most nonparametric tests use some way of ranking the measurements.
  • Nonparametric tests are used in cases where parametric tests are not appropriate.
    • Nonparametric tests are often necessary, specially when the distribution is not normal (skewness), the distribution is not known, or the sample size is too small (<30) to assume a normal distribution.
    • Also, if there are extreme values or values that are clearly “out of range” nonparametric tests should be used.

40.4 Summary

Validation


Quality Control


Index Numbers


41 Introduction to Data Mining

Definitions and Exercises are from the Book (Daniel T. Larose 2015)

41.1 Overview

  • “An Introduction to Data Mining and Predictive Analytics”
Definition 41.1 Data mining is the process of discovering useful patterns and trends in large data sets.
Definition 41.2 Predictive analytics is the process of extracting information from large data sets in order to make predictions and estimates about future outcomes.
  • The Cross-Industry Standard Process for Data Mining (CRISP-DM) (Iterative)
    • Business/Research Understanding Phase
      • Clearly enunciate the project objectives and requirements in terms of the business or research unit as a whole.
      • Translate these goals and restrictions into the formulation of a data mining problem definition.
      • Prepare a preliminary strategy for achieving these objectives.
    • Data Understanding Phase
      • Collect the data.
      • Use Exploratory Data Analysis (EDA) to familiarize yourself with the data, and discover initial insights.
      • Evaluate the quality of the data.
      • if desired, select interesting subsets that may contain actionable patterns.
    • Data Preparation Phase
      • Raw | Select | Filter | Subset | Clean | Transform
    • Modeling Phase
      • Select and apply appropriate modeling techniques
      • Calibrate
    • Evaluation Phase
      • Models must be evaluated for quality and effectiveness.
      • Also, determine whether the model in fact achieves the objectives set for it
      • Establish whether some important facet of the business or research problem has not been sufficiently accounted for.
    • Deployment Phase / Report / Publish

41.2 Data Mining Methods

  • Data Mining Methods and Definitions
    • Data mining methods may be categorized as either supervised or unsupervised.
    • Most data mining methods are supervised methods.
    • Unsupervised : Clustering, PCA, Factor Analysis, Association Rules, RFM
    • Supervised :
      • Regression (Continuous Target) : Linear Regression, Regularised Regression, Decision trees, Ensemble learning
        • Linear Regression : Ridge, Lasso and Elastic Regression
        • Ensemble learning : Bagging, Boosting (AdaBoost, XGBoost), Random forests
      • Classification (Categorical Target) : Decision trees, Ensemble learning, Logistic Regression, k-nearest neighbor (k-NN), Naive-Bayes
      • Deep Learning : Neural Networks
Definition 41.3 Description of patterns and trends often suggest possible explanations for existence of theme within the data.
Definition 41.4 In estimation, we approximate the value of a numeric target variable using a set of numeric and/or categorical predictor variables. Methods: Point Estimation, Confidence Interval Estimation, Simple Linear Regression, Correlation, Multiple Regression etc.
Definition 41.5 Prediction is similar to classification and estimation, except that for prediction, the results lie in the future. Estimation methods are also used for Prediction. Additional Methods: k-nearest neighbor methods, decision trees, neural networks etc.
Definition 41.6 Classification is similar to estimation, however, instead of approximating the value of a numeric target variable, the target variable is categorical.

46.1 Clustering refers to the grouping of records, observations, or cases into classes of similar objects. Clustering differs from classification in that there is no target variable for clustering.

46.2 A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters.

48.1 Affinity analysis, (or Association Rules or Market Basket Analysis), is the study of attributes or characteristics that “go together.” It seeks to uncover rules for quantifying the relationship between two or more attributes. Association rules take the form "If antecedent, then consequent", along with a measure of the support and confidence associated with the rule.

Validation


42 Data Processing

42.1 Data

Refer Lecture: Data Pre-Processing Refer Numerical Measures

Please import the "B18-Churn.xlsx" Please import the "B16-Cars2.csv"

  • Caution: The Cars2 dataset has 263 observations, whereas the Cars dataset has 261.

42.2 Flag Variables

Definition 42.1 A flag variable (or dummy variable, or indicator variable) is a categorical variable taking only two values, 0 and 1. Ex: Gender (Male, Female) can be recoded into dummy Gender (Male = 0, Female = 1).
  • When a categorical predictor takes \(k \geq 3\) possible values, then define \(k - 1\) dummy variables, and use the unassigned category as the reference category.
    • Ex: Region = North, East, South, West i.e. k = 4
      • 3 flags : flag_north, flag_east, flag_south; each will be 1 for their own region; 0 otherwise
      • The flag variable for the west is not needed, as ‘region = west’ is already uniquely identified by zero values for each of the three existing flag variables.
      • Instead, the unassigned category becomes the reference category, meaning that, the interpretation of the value of north_flag is ‘region = north’ compared to ‘region = west.’
        • For example, if we are running a regression analysis with income as the target variable, and the regression coefficient for north_flag equals 1000, then the estimated income for ‘region=north’ is 1000 greater than for ‘region=west,’ when all other predictors are held constant.
      • Further, inclusion of the fourth flag variable will cause some algorithms to fail, because of the singularity of the \((X^{T}X)^{-1}\) matrix in regression, for instance.

42.3 Transforming Categorical to Numerical

  • Ex: Region = North, East, South, West can be assigned numbers 1 to 4
    • It may result in algorithm treating them as continuous and /or ordered values
  • Generally, categorical varaiables should not be transformed into numerical, except when these are clearly ordered.
    • e.g. Survey Response is an ordered categorical variable and can be assigned values 1 to 5

42.4 Binning Continuous to Categorical

  • Ex: Income into low, medium, high
    • Equal width binning divides the numerical predictor into k categories of equal width
      • NOT recommended for most applications because it can be greatly affected by the outliers
    • Equal frequency binning divides the numerical predictor into k categories, each having k/n records, where n is the total number of records.
      • Simple. However, it has problem that sometimes same value can be found in two consecutive groups
      • Equal data values must belong to same category.
    • Binning by clustering uses a clustering algorithm, such as k-means clustering to automatically calculate the “optimal” partitioning.
    • Binning based on predictive value: Above methods ignore the target variable; binning based on predictive value partitions the numerical predictor based on the effect each partition has on the value of the target variable.
ERROR 42.1 Error: Insufficient data values to produce ... bins.
  • summary(cut_number(diamonds$depth, n = 27)) Passed
  • summary(cut_number(diamonds$depth, n = 28)) Failed (N = 53940)
  • ggplot2::cut_number() has some internal logic about size of the bins. It is not as simple as total size, it also has to do with relative size.
  • It also fails when Bins have overlap.
  • Use dplyr::ntile()
  • OR Pick a bin size that works for your data.
# #n = 12
bb <- c(1, 1, 1, 1, 1, 2, 2, 11, 11, 12, 12, 44)
#
# #Fixing Number of Bins: Unequal number of Observations and also Bins with 0 Observations
summary(cut(bb, breaks = 3))
## (0.957,15.3]  (15.3,29.7]    (29.7,44] 
##           11            0            1
summary(cut_interval(bb, n = 3))
##    [1,15.3] (15.3,29.7]   (29.7,44] 
##          11           0           1
#
# #For reference, NOT equivalent to above. 
summary(cut_width(bb, width = 15))
##  [-7.5,7.5]  (7.5,22.5] (22.5,37.5] (37.5,52.5] 
##           7           4           0           1
#
# #Using Equal Frequency: Same Observation may belong to different consecutive Bins
if(FALSE) ggplot2::cut_number(bb, n = 3) #ERROR
if(FALSE) { #Works
  #ceiling(seq_along(bb)/4)[rank(bb, ties.method = "first")] 
  tibble(bb = bb, RANK = rank(bb, ties.method = "first"), 
         ALONG = seq_along(bb), CEIL = ceiling(seq_along(bb)/4)) 
}
#
# #dplyr::ntile() can be used in place of ggplot2::cut_number()
dplyr::ntile(bb, n = 3)
##  [1] 1 1 1 1 2 2 2 2 3 3 3 3
#

42.5 Adding an Index Field (ID)

Caution: ID fields should be filtered out from the data mining algorithms, but should not be removed from the data. These are for easy identification of records not for correlation.

mtcars %>% mutate(ID = row_number()) %>% relocate(ID) %>% slice(1:6L)
##                   ID  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4          1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag      2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710         3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive     4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout  5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant            6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

42.6 Variable that should not be removed (probably)

  • Ex: An example of correlated variables may be precipitation and attendance at a state beach.
    • As precipitation increases, attendance at the beach tends to decrease, so that the variables are negatively correlated.
  • Inclusion of correlated variables may at best double-count a particular aspect of the analysis, and at worst lead to instability of the model results.
  • Thus, we may decide to simply remove one of the variables.
  • However, it should not be done, as important information may thereby be discarded.
  • Instead, it is suggested that PCA be applied, where the common variability in correlated predictors may be translated into a set of uncorrelated principal components.

Validation


43 EDA

43.1 Overview

43.2 Churn

Please import the "B18-Churn.xlsx"

  • Churn (attrition)
    • It is a term used to indicate a customer leaving the service of one company in favor of another company.
    • The data set contains 20 predictors worth of information about 3333 customers, along with the target variable, churn, an indication of whether that customer churned (left the company) or not.
  • Description [3333, 21]
    • State: Categorical, for the 50 states and the District of Columbia.
    • Account length: Integer-valued, how long account has been active.
    • Area code: Categorical
    • Phone number: Essentially a surrogate for customer ID.
    • International plan: Dichotomous categorical, yes or no.
    • Voice mail plan: Dichotomous categorical, yes or no.
    • Number of voice mail messages: Integer-valued.
    • Total day minutes: Continuous, minutes customer used service during the day.
    • Total day calls: Integer-valued.
    • Total day charge: Continuous, perhaps based on above two variables.
    • Total eve minutes: Continuous, minutes customer used service during the evening.
    • Total eve calls: Integer-valued.
    • Total eve charge: Continuous, perhaps based on above two variables.
    • Total night minutes: Continuous, minutes customer used service during the night.
    • Total night calls: Integer-valued.
    • Total night charge: Continuous, perhaps based on above two variables.
    • Total international minutes: Continuous, minutes customer used service to make international calls.
    • Total international calls: Integer-valued.
    • Total international charge: Continuous, perhaps based on above two variables.
    • Number of calls to customer service: Integer-valued.
    • Churn: Target. Indicator of whether the customer has left the company (True or False)

43.3 Basics, Skewness & Normality

Table 43.1: (C33T01) Churn: Normality
Key Min Max SD Mean Median Mode Unique isNA Skewness p_Shapiro isNormal
day_mins 0 350.8 54.467 179.78 179.4 154 1667 0 -0.0291 0.6401 TRUE
day_calls 0 165 20.069 100.44 101 102 119 0 -0.1117 0.0003 FALSE
day_charge 0 59.6 9.259 30.56 30.5 26.18 1667 0 -0.0291 0.6401 TRUE
eve_mins 0 363.7 50.714 200.98 201.4 169.9 1611 0 -0.0239 0.7125 TRUE
eve_calls 0 170 19.923 100.11 100 105 123 0 -0.0555 0.0088 FALSE
eve_charge 0 30.9 4.311 17.08 17.12 16.12 1440 0 -0.0238 0.7091 TRUE
night_mins 23.2 395 50.574 200.87 201.2 188.2 1591 0 0.0089 0.627 TRUE
night_calls 33 175 19.569 100.11 100 105 120 0 0.0325 0.2514 TRUE
night_charge 1.04 17.8 2.276 9.04 9.05 9.66 933 0 0.0089 0.6238 TRUE
intl_mins 0 20 2.792 10.24 10.3 10 162 0 -0.2449 0 FALSE
intl_calls 0 20 2.461 4.48 4 3 21 0 1.3203 0 FALSE
intl_charge 0 5.4 0.754 2.76 2.78 2.7 162 0 -0.2451 0 FALSE
custserv_calls 0 9 1.315 1.56 1 1 10 0 1.0904 0 FALSE
vmail_message 0 51 13.688 8.1 0 0 46 0 1.2637 0 FALSE
account_length 1 243 39.822 101.06 101 105 212 0 0.0965 0.0012 FALSE

43.4 Explore Categorical Variables

43.4.1 International Plan

Contingency Table (CrossTab)

  • 28.4% of churners belong to the International Plan, compared to 6.5% of non-churners.
  • 42.4% of the International Plan holders churned, as compared to only 11.5% of those without the International Plan.
  • The proportion of International Plan holders is greater among the churners.
  • Summary
    • We should investigate what is it about our international plan that is inducing our customers to leave
    • We should expect that, the model (in future) will probably include whether or not the customer selected the International Plan.
Table 43.2: (C33T02) International vs. Churn
Churn \(\rightarrow\)
\(\downarrow\) International
No

Yes

Row
SUM
Row
No %
Row
Yes %
Col
No %
Col
Yes %
Col
Row SUM %
No 2664 346 3010 88.5% 11.5% 93.5% 71.6% 90.3%
Yes 186 137 323 57.6% 42.4% 6.5% 28.4% 9.7%
Total 2850 483 3333 85.5% 14.5% 100% 100% 100%

Bar Charts

  • Bar Charts
    • Grouped
      • These are good for comparing between each element in the categories, and comparing elements across categories.
    • Stacked
      • They are great for showing the total because they visually aggregate all of the categories in a group.
      • The downside is that it becomes harder to compare the sizes of the individual categories.
      • Stacking also indicates a part to whole relationship.
    • Stacked Percent
      • The total quantity is hidden by using percentages, but it iss easier to see the relative difference between quantities in each group.
    • Summary
      • If there is no part to whole relationship (maybe there is overlap in the categories), then grouped
      • If there is a part to whole relationship, then the next question to ask is what relationship is the most important to show.
        • If the goal is to show sizes between individual categories, use a grouped column or bar chart.
        • If the goal is to show the total sizes of groups, use a regular stacked bar chart.
        • If the goal is to show relative differences within each group, use a stacked percentage column chart.
ERROR 43.1 Error: stat_count() can only have an x or y aesthetic.
ERROR 43.2 Error: stat_count() must not be used with a y aesthetic.
  • stat_count() Based on the data, choose stat
    • stat = ‘identity’ tells ggplot2 that you will provide the y-values (i.e. Wide Data)
    • stat = ‘count’ is the default, which implies that ggplot2 will count the aggregate number of rows for each x value. (i.e. Long Data)
Image
(C33P01 C33P02 C33P03) International Plan holders tend to churn more frequently(C33P01 C33P02 C33P03) International Plan holders tend to churn more frequently(C33P01 C33P02 C33P03) International Plan holders tend to churn more frequently

Figure 43.1 (C33P01 C33P02 C33P03) International Plan holders tend to churn more frequently

Code CrossTab
# #xyw: x (independent), y (dependent), c (column =y), r (row =x)
# #g (grouped), w (wider), l (longer), 
r_xyg <- "International"
c_xyg <- "Churn"
xyg <- bb %>% select(int_l_plan, churn) %>% rename(Predictor = 1, Target = 2) %>% 
  mutate(across(1:2, ~ifelse(., 'Yes', 'No'))) %>% 
  count(Predictor, Target) 
#
str(xyg)
# #IN: xyg (Predictor, Target), r_xyg, c_xyg 
# #Generate Contingency Table (CrossTab)
ctab <- xyg %>% 
  pivot_wider(names_from = Target, values_from = n, values_fill = 0, names_sort = TRUE) %>%
  #mutate(across(1, as.character)) %>% 
  add_row(summarise(., across(1, ~ "Total")), summarise(., across(where(is.numeric), sum))) %>% 
  mutate(SUM = rowSums(across(where(is.numeric)))) %>% 
  mutate(across(where(is.numeric) & !matches("SUM"), 
                list(r = ~ round(. /SUM, 3)), .names = "xx_{.fn}{.col}")) %>% 
  mutate(across(where(is.numeric) & !starts_with("xx_") & !matches("SUM"), 
                list(rp = ~ paste0(round(100 * . /SUM, 1), "%")), .names = "{.fn}{.col}")) %>%
  mutate(across(where(is.numeric) & !starts_with("xx_"), 
                list(c = ~ 2 * ./sum(.)), .names = "yy_{.fn}{.col}")) %>% 
  mutate(across(where(is.numeric) & !starts_with("xx_") & !starts_with("yy_"), 
                list(cp = ~ paste0(round(100 * 2 * . /sum(.), 1), "%")), .names = "{.fn}{.col}"))
Code CrossTab Print
hh <- ctab %>% select(-c(5:6, 9:11)) 
#names_hh <- names(hh)
names_hh <-c(paste0(c_xyg, " ", "$\\rightarrow$", "<br/>", "$\\downarrow$", " ", r_xyg), 
                "No <br/> <br/>", "Yes <br/> <br/>", "Row <br/> SUM", 
                "Row <br/> No %", "Row <br/> Yes %", 
                "Col <br/> No %", "Col <br/> Yes %", "Col <br/> Row SUM %") 
stopifnot(identical(ncol(hh), length(names_hh)))
stopifnot(nrow(hh) < 10)
#
cap_hh <- paste0("(C33T02) ", r_xyg, " vs. ", c_xyg)
#
kbl(hh,
  caption = cap_hh,
  col.names = names_hh,
  escape = FALSE, align = "c", booktabs = TRUE
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                html_font = "Consolas", font_size = 12,
                full_width = FALSE,
                #position = "float_left",
                fixed_thead = TRUE
  ) %>%
# #Header Row Dark & Bold: RGB (48, 48, 48) =HEX (#303030)
    row_spec(0, color = "white", background = "#303030", bold = TRUE,
             extra_css = "border-bottom: 1px solid; border-top: 1px solid"
    ) #%>% row_spec(row = 1:nrow(hh), color = "black")

43.4.2 Voice Mail Plan

Contingency Table (CrossTab)

  • 16.7% of those without the Voice Mail Plan are churners, as compared to 8.7% of customers who do have the Voice Mail Plan.
    • Thus, customers without the Voice Mail Plan are nearly twice as likely to churn as customers with the plan.
  • Summary
    • Perhaps we should enhance our Voice Mail Plan still further, or make it easier for customers to join it, as an instrument for increasing customer loyalty
    • We should expect that, the model (in future) will probably include whether or not the customer selected the Voice Mail Plan. Our confidence in this expectation is perhaps not quite as high as for the International Plan
Table 43.3: (C33T03) VoiceMail vs. Churn
Churn \(\rightarrow\)
\(\downarrow\) VoiceMail
No

Yes

Row
SUM
Row
No %
Row
Yes %
Col
No %
Col
Yes %
Col
Row SUM %
No 2008 403 2411 83.3% 16.7% 70.5% 83.4% 72.3%
Yes 842 80 922 91.3% 8.7% 29.5% 16.6% 27.7%
Total 2850 483 3333 85.5% 14.5% 100% 100% 100%

43.4.3 Three Categorical Variables

  • Two-way interactions among categorical variables with respect to churn.
    • Multilayer clustered bar chart of churn, clustered by both International Plan and Voice Mail Plan.
      • There are many more customers who have neither plan.
      • There are many more customers who have the voice mail plan only than have both plans.
      • More importantly, among customers with no voice mail plan, the proportion of churners is greater for those who do have an international plan (43.7%) than for those who do not (13.9%).
      • Again, however, among customers with the voice mail plan, the proportion of churners is much greater for those who also select the international plan (39.1%) than for those who do not (5.3%).
      • Note also that there is no interaction among the categorical variables. That is, international plan holders have greater churn regardless of whether they are Voice Mail plan adopters or not.

Image

(C33P04 C33P05) Churn vs. International & Voice Mail Plans (Both are Same Graphs)(C33P04 C33P05) Churn vs. International & Voice Mail Plans (Both are Same Graphs)

Figure 43.2 (C33P04 C33P05) Churn vs. International & Voice Mail Plans (Both are Same Graphs)

(C33P06) Churn vs. International & Voice Mail Plans (Proportional Stack)

Figure 43.3 (C33P06) Churn vs. International & Voice Mail Plans (Proportional Stack)

Code

xsyg <- bb %>% select(int_l_plan, vmail_plan, churn) %>% 
  rename(Intl = 1, Vmail =2, Churn = 3) %>% 
  mutate(across(Churn, ~ifelse(., "Yes", "No"))) %>% 
  mutate(across(Intl, ~ifelse(., "I: Yes", "I: No"))) %>% 
  mutate(across(Vmail, ~ifelse(., "V: Yes", "V: No"))) %>% 
  count(Intl, Vmail, Churn) %>% rename(N = n)
# #Multilayer Clustered Bar Chart
hh <- xsyg
#
cap_hh <- "C33P04"
ttl_hh <- "Churn: Churn vs. International & Voice Mail Plans"
sub_hh <- NULL 
x_hh <- "Churn" #r_xyg
y_hh <- "Frequency"
lgd_hh  <- "Churn" #c_xyg
#
C33 <- hh %>% { ggplot(., aes(x = Churn, y = N, fill = Churn)) + 
    geom_bar(position = "dodge", stat = "identity", alpha = 1) + 
    geom_text(position = position_stack(vjust = 0.5), aes(label = N)) +
    facet_wrap(Intl ~ Vmail, nrow = 1) +
    scale_fill_manual(values = c('#FFEA46FF', '#787877FF')) +
    theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
          panel.border = element_rect(colour = "black", fill = NA, size = 1),
          legend.position = 'top', 
          legend.box = "horizontal", legend.direction = "horizontal") +
    labs(x = x_hh, y = y_hh, fill = lgd_hh, 
         subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)
# #Multilayer Clustered Bar Chart
hh <- xsyg
#
cap_hh <- "C33P05"
ttl_hh <- "Churn: Churn vs. International & Voice Mail Plans"
sub_hh <- NULL 
x_hh <- NULL #"Churn" #r_xyg
y_hh <- "Frequency"
lgd_hh  <- "Churn" #c_xyg
#
C33 <- hh %>% { ggplot(., aes(x = Vmail, y = N, fill = Churn)) + 
    geom_bar(position = "dodge", stat = "identity", alpha = 1) + 
    geom_text(position = position_dodge(width = 1), aes(label = N), vjust = 1.5) +
    facet_wrap(~Intl, nrow = 1) +
    scale_fill_manual(values = c('#FFEA46FF', '#787877FF')) +
    theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
          panel.border = element_rect(colour = "black", fill = NA, size = 1),
          legend.position = 'top', 
          legend.box = "horizontal", legend.direction = "horizontal") +
    labs(x = NULL, y = y_hh, fill = lgd_hh, 
         subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

Code More

# #Percent Stacked Bar Chart
hh <- xsyg %>% group_by(Vmail, Intl) %>% 
    mutate(Ratio = paste0(round(100 * N/sum(N), 1), "%")) %>% ungroup()
#
cap_hh <- "C33P06"
ttl_hh <- "Churn: Churn vs. International & Voice Mail Plans"
sub_hh <- NULL 
x_hh <- "Voice Mail"
y_hh <- "Grouped Percentage"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(., aes(x = Vmail, y = N, fill = Churn)) + 
        geom_bar(position = "fill", stat = 'identity') + 
        facet_wrap(~Intl, nrow = 1) +
        geom_text(position = position_fill(vjust = 0.5), aes(label = Ratio), 
              colour = rep(c("black", "white"), 4)) +
        scale_fill_viridis_d(direction = -1) +  
        scale_y_continuous(labels = percent) +
        theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
              panel.border = element_rect(colour = "black", fill = NA, size = 1)) +
        labs(x = x_hh, y = y_hh, fill = lgd_hh, 
            subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

43.5 Exploring Numeric Variables

  • Refer figure 43.4 & Table 43.1
    • Fields not showing evidence of symmetry include voice mail messages and customer service calls.
    • Voice Mail Messages: Median = 0
      • Indicating that at least half of all customers had no voice mail messages.
      • Because fewer than half of the customers select the Voice Mail Plan (27.7%).
    • Customer Service Calls: Mean = 1.56 > Median = 1
      • It indicates some right-skewness
      • Also indicated by the maximum number of customer service calls (9).
(B18P03) Churn: All Histograms

Figure 43.4 (B18P03) Churn: All Histograms

43.5.1 Service Calls

  • Overlay Histogram
    • To explore whether a predictor is useful for predicting the target variable
    • Bars are colored according to the values of the target variable.
    • Normalised Histogram
      • Proportions stretched out to enable better contrast
        • Normalized histograms are useful for teasing out the relationship between a numerical predictor and the target. However, data analysts should always provide the companion a non-normalized histogram along with the normalized histogram, because the normalized histogram does not provide any information on the frequency distribution of the variable.
        • Ex: The churn rate for customers logging nine service calls is 100%; but there are only two customers with this number of calls.
      • Customers who have called customer service three times or less have a markedly lower churn rate than customers who have called customer service four or more times.
  • Summary
    • By the third service call, specialized incentives should be offered to retain customer loyalty, because, by the fourth call, the probability of churn increases greatly
    • We should expect that, the model (in future) will probably include the number of customer service calls made by the customer.

Image

(C33P08 C33P09 C33P10) Service Calls beyond 3 have signficant increase in Churn(C33P08 C33P09 C33P10) Service Calls beyond 3 have signficant increase in Churn(C33P08 C33P09 C33P10) Service Calls beyond 3 have signficant increase in Churn

Figure 43.5 (C33P08 C33P09 C33P10) Service Calls beyond 3 have signficant increase in Churn

(C33P11) Churn vs. Service Call Proportions (Bar)

Figure 43.6 (C33P11) Churn vs. Service Call Proportions (Bar)

Code

ii <- bb %>% select(custserv_calls, churn) %>% 
  rename(Churn = 2) %>% 
  mutate(across(Churn, ~ifelse(., "Yes", "No")))
# #Histogram Default (Not useful for association of predictor and target)
hh <- ii
ttl_hh <- "Churn: Customer Service Calls"
cap_hh <- "C33P08"
sub_hh <- "Predictor Only" 
x_hh <- "x"
y_hh <- "Frequency"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = custserv_calls, fill = '#FDE725FF')) + 
    geom_histogram(bins = length(unique(.[[1]])), alpha = 1) + 
    #stat_bin(bins = length(unique(.[[1]])), aes(y=..count.., label=..count..), 
    #         geom="text", position=position_stack(vjust=0.5)) +
    scale_x_continuous(breaks = breaks_pretty()) + 
    scale_fill_viridis_d(direction = -1) +
    theme(plot.title.position = "panel", 
          axis.title.x = element_blank(), 
          #legend.position = c(0.5, -0.08), legend.direction = 'horizontal', 
          legend.position = 'none') +
    labs(x = x_hh, y = y_hh, #fill = lgd_hh,
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)
# #Histogram (Predictor and Target Count)
hh <- ii
ttl_hh <- "Churn: Customer Service Calls & Churn"
cap_hh <- "C33P09"
sub_hh <- "Count of Predictor & Target" 
x_hh <- "x"
y_hh <- "Frequency"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = custserv_calls, fill = Churn)) + 
    geom_histogram(bins = length(unique(.[[1]])), alpha = 1) + 
    #stat_bin(bins = length(unique(.[[1]])), aes(y=..count.., label=..count..), 
    #         geom="text", position=position_stack(vjust=0.5)) +
    scale_x_continuous(breaks = breaks_pretty()) + 
    scale_fill_viridis_d(direction = -1) +
    theme(plot.title.position = "panel", 
          axis.title.x = element_blank(), 
          legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
    labs(x = x_hh, y = y_hh, fill = lgd_hh,
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)
# #Histogram (Predictor and Target Proportion)
hh <- ii
ttl_hh <- "Churn: Customer Service Calls & Churn"
cap_hh <- "C33P10"
sub_hh <- "Proportion of Predictor & Target (Using Histogram)" 
x_hh <- "x"
y_hh <- "Grouped Percentage"
lgd_hh  <- "Churn"
#
# #NOTES: Group wise Proportion i.e. Yes 1, No 1 (Not Each Bin wise)
# #Adding Density automatically converts the y-axis to 'frequency density' not to 'percentage'
# #Both will match when Bin Width =1 but otherwise there will be a mismatch
# #Using y=..density.. scales the histograms so the area under each is 1, or sum(binwidth*y)=1. 
# #So use y = binwidth *..density.. to have y represent the fraction of the total in each bin. 
# #OR aes(y = stat(width*density))
# #OR aes(y = stat(count / sum(count)))
# #Clarification
# # ..count../sum(..count..) each count is divided by the total count
# # ..density.. it is applied to each group independently
#
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = custserv_calls, fill = Churn)) + 
    geom_histogram(bins = length(unique(.[[1]])), alpha = 1, position = 'fill') + 
    #stat_bin(bins = length(unique(.[[1]])), 
    #         aes(y = c(..count..[..group..==1]/sum(..count..[..group..==1]),
    #                   ..count..[..group..==2]/sum(..count..[..group..==2])), 
    #             label=round(..density.., 2)), geom="text") +
    #stat_bin(bins = length(unique(.[[1]])), 
    #         aes(y=..density.., group = ..group.., 
    #             label=round(..density.., 2)), geom="text") +
    scale_x_continuous(breaks = breaks_pretty()) + 
        #Deprecated : percent_format()
        scale_y_continuous(labels = percent) +
    scale_fill_viridis_d(direction = -1) +
    theme(plot.title.position = "panel", 
          axis.title.x = element_blank(), 
          legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
    labs(x = x_hh, y = y_hh, fill = lgd_hh,
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

Code More

# #complete() can be used to add the missing combination but not using it for now
# #NOTE: Bar is easier to include Labels, However Histogram is easier for large number of bins
xyg  <- ii %>% count(custserv_calls, Churn) 
     #%>% complete(custserv_calls, Churn, fill = list(n = 0)) 
hh <- xyg %>% rename(Group = 1, SubGroup = 2, N = 3) %>% 
  group_by(Group) %>% 
  mutate(Ratio = paste0(round(100 * N/sum(N), 1), "%")) %>% ungroup()
#
ttl_hh <- "Churn: Customer Service Calls & Churn"
cap_hh <- "C33P11"
sub_hh <- "Proportion of Predictor & Target (Using Bar)" 
x_hh <- "x"
y_hh <- "Grouped Percentage"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(., aes(x = Group, y = N, fill = SubGroup)) + 
    geom_bar(position = "fill", stat = 'identity') + 
    geom_text(position = position_fill(vjust = 0.5), 
              aes(label = ifelse(SubGroup == 'No', "", Ratio)), 
              colour = c(rep(c("black", "white"), 9), "white")) +
    scale_fill_viridis_d(direction = -1) +
    scale_x_continuous(breaks = breaks_pretty()) +
    scale_y_continuous(labels = percent) +
    theme(plot.title.position = "panel", axis.title.x = element_blank(), 
          panel.grid.major.x = element_blank(), axis.line = element_blank(),
          panel.border = element_rect(colour = "black", fill = NA, size = 1),
          legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
    labs(x = x_hh, y = y_hh, fill = lgd_hh, 
         subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

43.5.2 Day Minutes

  • Summary: Day Minutes
    • High day-users tend to churn at a higher rate.
    • As the number of day minutes passes 200, we should consider special incentives
    • We should investigate why heavy day-users are tempted to leave
    • We should expect that, the model (in future) will probably include ‘day minutes.’
(C33P12 C33P13) Higher Day Minutes (>200) have higher Churn(C33P12 C33P13) Higher Day Minutes (>200) have higher Churn

Figure 43.7 (C33P12 C33P13) Higher Day Minutes (>200) have higher Churn

43.5.3 Evening Minutes

  • Summary : Evening Minutes
    • It shows a slight tendency for customers with higher evening minutes to churn.
    • However, it is inconclusive solely based on graph.
(C33P14 C33P15) (Inconclusive) Slight tendency to Churn with higher Evening Minutes(C33P14 C33P15) (Inconclusive) Slight tendency to Churn with higher Evening Minutes

Figure 43.8 (C33P14 C33P15) (Inconclusive) Slight tendency to Churn with higher Evening Minutes

43.5.4 Night Minutes

  • Summary : Night Minutes
    • There is no obvious association visible (in graphs) between churn and night minutes
    • The lack of obvious association at the EDA stage between a predictor and a target variable is not sufficient reason to omit that predictor from the model.
    • Unless there is a good reason for eliminating the variable before modeling, we should keep them and allow the modeling process to identify which variables are predictive and which are not.
(C33P16 C33P17) No obvious association between Churn and Night Minutes(C33P16 C33P17) No obvious association between Churn and Night Minutes

Figure 43.9 (C33P16 C33P17) No obvious association between Churn and Night Minutes

43.5.5 International Calls

  • Summary : International Calls
    • There is no obvious association visible (in graphs) between churn and International Calls
    • However: t-test for the difference in mean number of international calls for churners and non-churners is statistically significant. i.e. Means are Different.
    • NOTE: Hypothesis Testing is NOT part of EDA. It is mentioned here to show that statistically significant association may exist without being obviously visible.

Image

(C33P20 C33P21) No obvious association between Churn and International Calls (But t-test)(C33P20 C33P21) No obvious association between Churn and International Calls (But t-test)

Figure 43.10 (C33P20 C33P21) No obvious association between Churn and International Calls (But t-test)

t-test

# #t-test for difference in Mean of Target Values
str(ii)
## tibble [3,333 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Predictor: int [1:3333] 3 3 5 7 3 6 7 6 4 5 ...
##  $ Target   : chr [1:3333] "No" "No" "No" "No" ...
# 
# #Variance
var_ii <- var.test(formula = Predictor ~ Target, data = ii)
var_ii
## 
##  F test to compare two variances
## 
## data:  Predictor by Target
## F = 0.91594, num df = 2849, denom df = 482, p-value = 0.1971
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7962798 1.0465337
## sample estimates:
## ratio of variances 
##          0.9159437
#
isVarEqual <- ifelse(var_ii$p.value > 0.05, TRUE, FALSE)
if(isVarEqual) print("Variances are Equal.") else print("Variances are Different.")
## [1] "Variances are Equal."
#
# #t-test: Welch 
ha_ii <- "two.sided" #"two.sided", "less", "greater"
#tt_ii <- t.test(Predictor ~ Target, data = ii, alternative = ha_ii, var.equal = isVarEqual)
tt_ii <- t.test(Predictor ~ Target, data = ii, alternative = ha_ii, var.equal = FALSE)
tt_ii
## 
##  Welch Two Sample t-test
## 
## data:  Predictor by Target
## t = 2.9604, df = 640.64, p-value = 0.003186
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  0.1243807 0.6144620
## sample estimates:
##  mean in group No mean in group Yes 
##          4.532982          4.163561
#
alpha <- 0.05
if(any(all(ha_ii == "two.sided", tt_ii$p.value >= alpha / 2), 
       all(ha_ii != "two.sided", tt_ii$p.value >= alpha))) {
    print("Failed to reject H0.")
} else { 
    print("H0 Rejected.")
}
## [1] "H0 Rejected."

43.5.6 All Histograms

Images

(C33P18) Histograms of All Predictors with Target (Count)

Figure 43.11 (C33P18) Histograms of All Predictors with Target (Count)

(C33P19) Histograms of All Predictors with Target (Proportion)

Figure 43.12 (C33P19) Histograms of All Predictors with Target (Proportion)

Code

# #All Continuous Predictors and Target (Categorical)
xsy <- bb %>% 
  select(where(is.numeric) | "churn") %>% 
  select(!area_code) %>% 
  relocate(ends_with("_mins")) %>% 
  relocate(ends_with("_calls")) %>% 
  relocate(vmail_message, .after =  last_col()) %>% 
  relocate("churn") %>% rename("Target" = 1) %>% 
  mutate(across(Target, ~ifelse(., "Yes", "No"))) %>% 
  mutate(across(Target, factor, levels = unique(Target)))
#
xsyl <- xsy %>% pivot_longer(where(is.numeric), names_to = "Predictors", values_to = "Values") %>% 
  mutate(across(Predictors, ~ factor(., levels = unique(Predictors))))
#
#str(ii)
# #Histogram
hh <- xsyl
ttl_hh <- "Churn: Histograms of All Predictors with Target (Count)"
cap_hh <- "C33P18"
sub_hh <- NULL #"Count of Predictor & Target" 
x_hh <- NULL # "x"
y_hh <- NULL # "Frequency" #"Grouped Percentage"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = Values, fill = Target)) + 
    geom_histogram(alpha = 1, boundary = 0, #position = position_stack(reverse = TRUE),
        bins = ifelse(length(unique(.$Predictors)) > 50, 50, length(unique(.$Predictors)))) + 
    #geom_histogram(alpha = 1, boundary = 0,
    #    bins = ifelse(nrow(distinct(.[2])) > 50, 50, nrow(distinct(.[2])))) + 
    facet_wrap(~Predictors, nrow = 3, scales = 'free') +
    scale_x_continuous(breaks = breaks_pretty()) + 
    scale_fill_viridis_d(direction = -1) +
    theme(plot.title.position = "panel", 
          strip.text.x = element_text(size = 10, colour = "white"), 
          axis.title.x = element_blank(), 
          legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
    labs(x = x_hh, y = y_hh, fill = lgd_hh, 
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)
# #Histogram
hh <- xsyl
ttl_hh <- "Churn: Histograms of All Predictors with Target (Proportion)"
cap_hh <- "C33P19"
sub_hh <- NULL #"Count of Predictor & Target" 
x_hh <- NULL # "x"
y_hh <- NULL # "Frequency" #"Grouped Percentage"
lgd_hh  <- "Churn"
#
# #Caution: Warning on Removal of Missing Values has been removed
# #NOTE: position_fill() normalizes the Bars and is same as 'fill'
# # 
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = Values, fill = Target)) + 
    geom_histogram(alpha = 1, boundary = 0, position = position_fill(), na.rm = TRUE, 
        bins = ifelse(length(unique(.$Predictors)) > 50, 50, length(unique(.$Predictors)))) + 
    #geom_histogram(alpha = 1, boundary = 0,
    #    bins = ifelse(nrow(distinct(.[2])) > 50, 50, nrow(distinct(.[2])))) + 
    facet_wrap(~Predictors, nrow = 3, scales = 'free_x') +
    scale_x_continuous(breaks = breaks_pretty()) + 
    scale_fill_viridis_d(direction = -1) +
    theme(plot.title.position = "panel", 
          strip.text.x = element_text(size = 10, colour = "white"), 
          axis.title.x = element_blank(), 
          legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
    labs(x = x_hh, y = y_hh, fill = lgd_hh, 
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

43.6 Exploring Multivariate Variables : Scatter plots

43.6.1 Day minutes & Evenings minutes

  • Summary:
    • Customers with both high day minutes and high evening minutes, appear to have a higher proportion of churners than records below the line.

Image

(C33P22) Scatterplot of Evening Minutes (X) and Day Minutes (Y) shows a clear separation for Churn at High X and High Y

Figure 43.13 (C33P22) Scatterplot of Evening Minutes (X) and Day Minutes (Y) shows a clear separation for Churn at High X and High Y

Code

ii <- bb %>% select(day_mins, eve_mins, churn)  %>% 
  rename(Target = churn) %>% 
  mutate(across(Target, ~ifelse(., "Yes", "No")))
hh <- ii
ttl_hh <- "Churn: Scatterplot of Evening Minutes and Day Minutes"
cap_hh <- "C33P22"
sub_hh <- NULL #subtitle = TeX(r"(Trendline Equation, $R^{2}$, $\bar{x}$ and $\bar{y}$)")
x_hh <- "Evening Minutes" # "x"
y_hh <- "Day Minutes" # "Frequency" #"Grouped Percentage"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., aes(x = eve_mins, y = day_mins)) + 
    #geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
    #geom_point(position = "jitter", aes(colour = Target)) +
    geom_jitter(aes(colour = Target)) +
        scale_colour_viridis_d(alpha = 0.9, direction = -1) +
    theme(panel.grid.minor = element_blank(),
          panel.border = element_blank()) +
    labs(x = x_hh, y = y_hh, colour = lgd_hh,
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

43.6.2 Day minutes & Customer Service Calls

  • Summary:
    • It indicates a high-churn area in the upper left section of the graph. These records represent customers who have a combination of a high number of customer service calls and a low number of day minutes used.
    • Note that this group of customers could not have been identified had we restricted ourselves to univariate exploration (exploring variable by single variable). This is because of the interaction between the variables.
    • In general, customers with higher numbers of customer service calls tend to churn at a higher rate, as we learned earlier in the univariate analysis. However, of these customers with high numbers of customer service calls, those who also have high day minutes are somewhat “protected” from this high churn rate. The customers in the upper right of the scatter plot exhibit a lower churn rate than those in the upper left.

Image

(C33P27) Scatterplot of Day Minutes (X) and Customer Service Calls (Y) shows an interaction effect for Churn

Figure 43.14 (C33P27) Scatterplot of Day Minutes (X) and Customer Service Calls (Y) shows an interaction effect for Churn

Code

ii <- bb %>% select(day_mins, custserv_calls, churn)  %>% 
  rename(Target = churn) %>% 
  mutate(across(Target, ~ifelse(., "Yes", "No")))
hh <- ii
ttl_hh <- "Churn: Day Minutes and Customer Service Calls"
cap_hh <- "C33P27"
sub_hh <- NULL
x_hh <- "Day Minutes"
y_hh <- "Customer Service Calls"
lgd_hh  <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., aes(x = day_mins, y = custserv_calls)) + 
    geom_jitter(aes(colour = Target), width = 0.1, height = 0.1) +
    scale_colour_viridis_d(alpha = 0.9, direction = -1) +
    scale_y_continuous(breaks = breaks_pretty()) + 
    theme(panel.grid.minor = element_blank(),
          panel.border = element_blank()) +
    labs(x = x_hh, y = y_hh, colour = lgd_hh,
         caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)

43.6.3 package:GGally

SPLOM

  • Scatter Plot of Matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.
(C33P23) Bivariate Analysis of Calls showing Churn

Figure 43.15 (C33P23) Bivariate Analysis of Calls showing Churn

(C33P24) Bivariate Analysis of Minutes & Charges showing Churn

Figure 43.16 (C33P24) Bivariate Analysis of Minutes & Charges showing Churn

GGally Bivariate

# #Assumes Column 1 has Target Variable (Factor)
C33 <- hh %>% { 
  ggpairs(data = ., mapping = aes(colour = Target, fill = Target, alpha = 0.3), 
          columns = 2:ncol(.), 
          lower = list(continuous = f_gg_scatter),
          diag = list(continuous = f_gg_density)) +
    labs(caption = cap_hh, subtitle = sub_hh, title = ttl_hh) 
}
assign(cap_hh, C33)
rm(C33)

GGally Manual

# #For GGally Manual Functions
f_gg_scatter <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) +
    geom_jitter(...) +
    scale_colour_viridis_d(direction = -1)
}

f_gg_density  <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) +
    geom_density(...) +
    scale_fill_viridis_d(direction = -1) + 
      scale_colour_viridis_d(direction = -1) 
}

43.6.4 package:psych

SPLOM

(C33P25) Bivariate Analysis of Calls showing Churn

Figure 43.17 (C33P25) Bivariate Analysis of Calls showing Churn

(C33P26) Bivariate Analysis of Minutes & Charges showing Churn

Figure 43.18 (C33P26) Bivariate Analysis of Minutes & Charges showing Churn

pairs.panels()

# #IN: hh, cap_hh, ttl_hh, loc_png
if(!file.exists(loc_png)) {
  png(filename = loc_png, width = k_width, height = k_height, units = "in", res = 144) 
  pairs.panels(hh[2:ncol(hh)], smooth = FALSE, jiggle = TRUE, rug = FALSE, ellipses = FALSE, 
               bg = rev(viridis(2))[hh$Target], pch = 21, lwd = 1, cex.cor = 1, cex = 1, 
               gap = 0, main = ttl_hh)
  #title(main = ttl_hh, line = 2, adj = 0)
  title(sub = cap_hh, line = 4, adj = 1)
  C33 <- recordPlot()
  dev.off()
  assign(cap_hh, C33)
  rm(C33)
}

43.7 Subset Interesting Datapoints

  • Those which were identified by EDA
  • Anomalous fields like Area Code which have only 3 values yet it is distributed over all the States

43.8 Binning

  • Figure 43.6 showed that customers with less than four calls to customer service had a lower churn rate than customers who had four or more calls to customer service.
    • We may therefore decide to bin the customer service calls variable into two classes, low (fewer than four) and high (four or more).
(C33P28 C33P29 C33P30) High Customer Service Calls have major Churn(C33P28 C33P29 C33P30) High Customer Service Calls have major Churn(C33P28 C33P29 C33P30) High Customer Service Calls have major Churn

Figure 43.19 (C33P28 C33P29 C33P30) High Customer Service Calls have major Churn

Table 43.4: (C33T04) Service Calls vs. Churn
Churn \(\rightarrow\)
\(\downarrow\) Service Calls
No

Yes

Row
SUM
Row
No %
Row
Yes %
Col
No %
Col
Yes %
Col
Row SUM %
Low 2721 345 3066 88.7% 11.3% 95.5% 71.4% 92%
High 129 138 267 48.3% 51.7% 4.5% 28.6% 8%
Total 2850 483 3333 85.5% 14.5% 100% 100% 100%
  • Similarly, Evening Minutes can be given flags (Categorical)
    • Low (<= 160), Medium (160, 240), High (>240)
    • Recall that the baseline churn rate for all customers is 14.49%. The medium group comes in very close to this baseline rate, 14.1%.
    • However, the High evening minutes group has nearly double the churn proportion (19.5%) compared to the low evening minutes group (10%).
    • The chi-square test is significant, meaning that these results are most likely real and not due to chance alone. “ForLater”

43.9 Correlation

  • The threshold for significance of the correlation coefficient \(r\) depends not only on the sample size but also on data mining, where there are a large number of records (over 1000), even small values of r, such as −0.1 ≤ r ≤ 0.1 may be statistically significant.
    • One should take care to avoid feeding correlated variables to data mining and statistical models.
    • At best, using correlated variables will overemphasize one data component; at worst, using correlated variables will cause the model to become unstable and deliver unreliable results.
    • However, just because two variables are correlated does not mean that we should omit one of them.
      • Identify any variables that are perfectly correlated (i.e., r = 1.0 or r = −1.0). Do not retain both variables in the model, but rather omit one.
      • Identify groups of variables that are correlated with each other. Then, later, during the modeling phase, apply dimension-reduction methods, such as PCA, to these variables.
    • Example: Charge is perfectly correlated to Minutes (For each of Day, Eve, Night, International). Thus it can be eliminated. Thus number of predictors has been reduced from 20 to 16.
    • Correlation coefficient 0.038 between account length and day calls has a small p-value of 0.026, telling us that account length and day calls are positively correlated. We should note this, and prepare to apply the PCA during the modeling phase.
    • “ForLater” - Get Pearson Correlation Coefficient

ALL Continuos Variables

# #All Continuous Predictors and Target (Categorical)
# #Select Relevant | Binning | Rename & Relocate | Relabel & Factor Target|
xsy <- bb %>% select( (where(is.numeric) & ! "area_code") | "churn") %>% 
  relocate(ends_with("_mins")) %>% 
  relocate(ends_with("_calls")) %>% 
  relocate(vmail_message, .after =  last_col()) %>% 
  relocate(Target = "churn") %>% 
  mutate(across(Target, ~ifelse(., "Yes", "No"))) %>% 
  mutate(across(Target, factor, levels = unique(Target)))
# #
xsyl <- xsy %>% pivot_longer(where(is.numeric), names_to = "Predictors", values_to = "Values") %>% 
  mutate(across(Predictors, ~ factor(., levels = unique(Predictors))))
# #
# #Select All Minutes and All Charges (8)
hh <- xsy %>% select(1, 7:10, 12:15) 
#
ttl_hh <- "GGally: Churn: SPLOM: Minutes and Charges (8)"
cap_hh <- "C33P31"
sub_hh <- NULL #"Count of Predictor & Target" 
lgd_hh  <- "Churn"

GGally

# #Assumes Column 1 has Target Variable (Factor)
C33 <- hh %>% { 
  ggpairs(data = ., mapping = aes(colour = Target, fill = Target, alpha = 0.3), 
          columns = 2:ncol(.), 
          lower = list(continuous = f_gg_scatter),
          diag = list(continuous = f_gg_density)) +
    labs(caption = cap_hh, subtitle = sub_hh, title = ttl_hh) 
}
assign(cap_hh, C33)
rm(C33)

Psych

# #IN: hh, cap_hh, ttl_hh, loc_png
if(!file.exists(loc_png)) {
  png(filename = loc_png, width = k_width, height = k_height, units = "in", res = 144) 
  pairs.panels(hh[2:ncol(hh)], smooth = FALSE, jiggle = TRUE, rug = FALSE, ellipses = FALSE, 
               bg = rev(viridis(2))[hh$Target], pch = 21, lwd = 1, cex.cor = 1, cex = 1, 
               gap = 0, main = ttl_hh)
  #title(main = ttl_hh, line = 2, adj = 0)
  title(sub = cap_hh, line = 4, adj = 1)
  C33 <- recordPlot()
  dev.off()
  assign(cap_hh, C33)
  rm(C33)
}

chart.Correlation

# #For Reference Only. Only add Histogram at the Diagonal to the Base pairs()
PerformanceAnalytics::chart.Correlation(hh[2:ncol(hh)], histogram = TRUE)

corrplot() vs. corPlot()

(C33P31) corrplot::corrplot vs. psych::corPlot()

Figure 43.20 (C33P31) corrplot::corrplot vs. psych::corPlot()

43.10 Insights

  • The four ‘charge’ fields are linear functions of the ‘minute’ fields, and should be omitted.
  • The ‘area code’ field and/or the ‘state’ field are anomalous, and should be omitted.
  • Customers with the International Plan tend to churn more frequently.
  • Customers with the Voice Mail Plan tend to churn less frequently.
  • Customers with four or more Customer Service Calls churn more than four times as often as the other customers.
  • Customers with both high Day Minutes and high Evening Minutes tend to churn at a higher rate (6 times) than the other customers.
  • Customers with low Day Minutes and high Customer Service Calls churn at a higher rate than the other customers.
  • Customers with lower numbers of International Calls churn at a higher rate than customers with more international calls.

Validation


44 Dimension Reduction

44.1 Dimensions

10 or 11 Dimensions are enough for the Universe. How many are needed for your data!

Multicollinearity

Definition 44.1 Multicollinearity is a condition where some of the predictor variables are strongly correlated with each other.
  • Problems: Multicollinearity
    • Multicollinearity leads to instability in the solution space, leading to possible incoherent results, such as in multiple regression, where a multicollinear set of predictors can result in a regression which is significant overall, even when none of the individual variables is significant.
    • Even if such instability is avoided, inclusion of variables which are highly correlated tends to overemphasize a particular component of the model, as the component is essentially being double counted.
  • Problems: Too many variables
    • The sample size needed to fit a multivariate function grows exponentially with the number of variables.
    • The use of too many predictor variables to model a relationship with a response variable can unnecessarily complicate the interpretation of the analysis, and violates the principle of parsimony
      • i.e. keep the number of predictors to such a size that would be easily interpreted.
    • Also, retaining too many variables may lead to overfitting
      • i.e. generality of the findings is hindered because new data do not behave the same as the training data for all the variables.

Parsimony

Definition 44.2 Principle of parsimony is the problem-solving principle that “entities should not be multiplied beyond necessity.”
  • It is inaccurately paraphrased as “the simplest explanation is usually the best one.”
  • It advocates that when presented with competing hypotheses about the same prediction, one should select the solution with the fewest assumptions, and that this is not meant to be a way of choosing between hypotheses that make different predictions.

Overfitting & Underfitting

Definition 44.3 Overfitting is the production of an analysis that corresponds too closely to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.
Definition 44.4 Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data.
  • An overfitted model is a statistical model that contains more parameters than can be justified by the data.
    • The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented underlying model structure.
    • over-fitting occurs when a model begins to “memorize” training data rather than “learning” to generalize from a trend.
  • Under-fitting would occur, for example, when fitting a linear model to non-linear data. Such a model will tend to have poor predictive performance.
  • Overfitting Example
    • If the number of parameters is the same as or greater than the number of observations, then a model can perfectly predict the training data simply by memorizing the data in its entirety. Such a model, though, will typically fail severely when making predictions.
    • A noisy linear dataset can be fitted by a polynomial function also which would give a perfect fit. However, the linear function (in this case) would be better in extrapolating beyond the fitted data.
  • To decrease the chance or amount of overfitting
    • model comparison, cross-validation, regularization, early stopping, pruning, Bayesian priors, or dropout
    • To explicitly penalize overly complex models
    • To evaluate the model performance on a set of data not used for training, which is assumed to approximate the typical unseen data that a model will encounter.

44.2 Dimension-reduction methods

  • These use the correlation structure among the predictor variables to accomplish the following:
    • To reduce the number of predictor items.
    • To help ensure that these predictor items are independent.
    • To provide a framework for interpretability of the results.
  • Dimension-reduction methods:
    • PCA
    • Factor analysis
    • User-defined composites

44.3 PCA

Definition 44.5 Principal components analysis (PCA) seeks to explain the correlation structure of a set of predictor variables \({m}\), using a smaller set of linear combinations of these variables, called components \({k}\). PCA acts solely on the predictor variables, and ignores the target variable.
  • Suppose that the original variables \({\{X_1, X_2, \ldots, X_m\}}\) form a coordinate system in m-dimensional space.
    • Let each variable \({X_i}\) represent an \({n \times 1}\) vector, where \({n}\) is number of records.
    • The principal components represent a new coordinate system, found by rotating the original system along the directions of maximum variability.
  • Analysis
    • Standardize the data, so that the mean for each variable is zero, and the standard deviation is one.
    • \({X_i \to Z_i = \frac{X_i - {\mu}_i}{{\sigma}_{ii}}}\)
    • The covariance is a measure of the degree to which two variables vary together.
    • A positive covariance indicates that, when one variable increases, the other tends to increase, while a negative covariance indicates that, when one variable increases, the other tends to decrease.
    • \({{\sigma}_{ii}^2}\) denotes the variance of \({X_i}\).
      • If \({X_i}\) and \({X_j}\) are independent, then \({{\sigma}_{ij}^2 = 0}\); but reverse may not be TRUE i.e. \({{\sigma}_{ij}^2 = 0}\) does not imply that \({X_i}\) and \({X_j}\) are independent.
      • Note that the covariance measure is not scaled, so that changing the units of measure would change the value of the covariance.
      • The correlation coefficient \({r_{ij}} = \frac{{\sigma}_{ij}^2}{{\sigma}_{ii}{\sigma}_{jj}}\) avoids this difficulty by scaling the covariance by each of the standard deviations.
      • Then, the correlation matrix is denoted as \({\rho}\)
Definition 44.6 Let \(\mathbf{B}\) be an \(m \times m\) matrix, and let \(\mathbf{I}\) be the \(m \times m\) identity matrix. Then the scalars \(\{\lambda_1, \lambda_2, \ldots, \lambda_m\}\) are said to be the eigenvalues of \(\mathbf{B}\) if they satisfy \(|\mathbf{B} - \lambda \mathbf{I}| = 0\), where \(|\mathbf{Q}|\) denotes the determinant of Q.
Definition 44.7 Let \(\mathbf{B}\) be an \(m \times m\) matrix, and let \({\lambda}\) be an eigenvalue of \(\mathbf{B}\). Then nonzero \(m \times 1\) vector \(\overrightarrow{e}\) is said to be an eigenvector of B, if \(\mathbf{B} \overrightarrow{e} = 𝜆\overrightarrow{e}\).
  • The total variability in the standardized set of predictors equals the sum of the variances of the Z-vectors, which equals the sum of the variances of the components, which equals the sum of the eigenvalues, which equals the numer of predictors
    • i.e. \(\sum_{i=1}^m {\text{Var}({Y_i})} = \sum_{i=1}^m {\text{Var}({Z_i})} = \sum_{i=1}^m {\lambda_i} = m\)
  • The partial correlation between a given component and a given predictor variable is a function of an eigenvector and an eigenvalue. Specifically, \(\text{Corr}(Y_i, Z_j) = e_{ij}\sqrt{\lambda_i}\), where \(\{ (\lambda_1, e_1), (\lambda_2, e_2), \ldots, (\lambda_m, e_m)\}\) are the eigenvalue-eigenvector pairs for the correlation matrix \(\rho\), and we note that \(\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_m\). In other words, the eigenvalues are ordered by size. (A partial correlation coefficient is a correlation coefficient that takes into account the effect of all the other variables.)
  • The proportion of the total variability in Z that is explained by the \(i^{\text{th}}\) principal component is the ratio of the \(i^{\text{th}}\) eigenvalue to the number of variables, that is, the ratio \(\frac{\lambda_i}{m}\).

44.4 Data Housing

Please import the "C34-cadata.txt"

  • Source: http://lib.stat.cmu.edu/datasets/houses.zip
  • About: [20640, 9]
    • It provides census information from all the block groups from the 1990 California census.
    • For this data set, a block group has an average of 1425.5 people living in an area that is geographically compact.
    • Block groups were excluded that contained zero entries for any of the variables.
    • Variables: median house value (Target), median income, housing median age, total rooms, total bedrooms, population, households, latitude, and longitude.

44.5 Partition Data

Train & Test

set.seed(3)  #wwww
# #sample() and its variants can be used for Partitioning of Dataset
if(FALSE) {
  # #This approach is difficult to extend for 3 or more splits
  # #Further, Using floor() multiple times might result in loss of a row 
  # #sample() can automatically expand the sequence from 1 to x if single x is given 
  # #This is the Caution about the sample(). We are avoiding this for fewer bugs in future.
  train_idx <- sample.int(n = nrow(bb), size = floor(0.8 * nrow(bb)), replace = FALSE)
  train_idx <- sample(seq_len(nrow(bb)), size = floor(0.8 * nrow(bb)), replace = FALSE)  
  train_bb <- bb[train_idx, ]
  test_bb <- bb[-train_idx, ]
}
#
# #For 3 or more splits
#brk_bb = c(train = 0.8, test = 0.1, validate = 0.1)
brk_bb = c(train = 0.9, test = 0.1)
idx_bb = sample(cut(seq_len(nrow(bb)), nrow(bb) * cumsum(c(0, brk_bb)), labels = names(brk_bb)))
#
# #Splits by Fixed Numbers not Percentages
if(FALSE) {
  brk_bb = c(train = 18570, test = nrow(bb))
  idx_bb = sample(cut(seq_len(nrow(bb)), c(0, brk_bb), labels = names(brk_bb)))
}
# #List of Multiple Tibbles
part_l = split(bb, idx_bb)
#
# #nrow(), ncol(), dim() can be applied
vapply(part_l, nrow, FUN.VALUE = integer(1))
## train  test 
## 18576  2064
stopifnot(identical(nrow(bb), sum(vapply(part_l, nrow, FUN.VALUE = integer(1)))))

sample()

# #Create Test Datasets of 10 items of Numeric and Character.
# #Initial 10 letters or Numbers were not chosen to show difference between 
# #indexing numbers /position and actual item values.
ii_num <- 11:20
ii_num
##  [1] 11 12 13 14 15 16 17 18 19 20
jj_chr <- letters[11:20]
jj_chr
##  [1] "k" "l" "m" "n" "o" "p" "q" "r" "s" "t"
#
set.seed(3)
# #Note: length() is used on vectors & nrow() is used on dataframes
# #Choose 60% samples out of 10 items
sample(1:length(ii_num), 0.6 * length(ii_num))
## [1] 5 7 4 2 3 8
sample(1:length(jj_chr), 0.6 * length(jj_chr))
## [1] 10  7  8  5  2  9
#
# #Note the outout above is index of numbers for both cases and always gives numbers within [1, 10]
# #Note the output below always gives the numbers within [2, 10]. "1" index will not be available.
sample(2:length(ii_num), 0.6 * length(ii_num))
## [1]  6  9 10  2  7  5
sample(2:length(jj_chr), 0.6 * length(jj_chr))
## [1] 9 7 3 2 5 6

44.6 Basics

Table 44.1: (C34T01)[18576, 9] Houses: Training Basics
Keys Min Max SD Mean Median Mode Unique isNA
House Value (Median) 14999 500001 115304.56 206928.9 179700 500001 3767 0
Income (Median) 0.5 15 1.9 3.87 3.53 15 11952 0
House Age (Median) 1 52 12.57 28.6 29 52 52 0
Rooms (Total) 2 39320 2191.95 2641.99 2127 1613 5733 0
Bedrooms (Total) 1 6445 425.22 539.53 435 280 1886 0
Population 3 35682 1140.33 1427.82 1166 850 3781 0
Households 1 6082 385.55 501.02 410 306 1773 0
Latitude 32.5 42 2.13 35.62 34.25 34.1 851 0
Longitude -124.3 -114 2 -119.56 -118.49 -118.3 832 0

44.7 Boxplot

(C34P01) Houses: Boxplot (Scaled)

Figure 44.1 (C34P01) Houses: Boxplot (Scaled)

44.8 Normality

Note that normality of the data is not strictly required to perform non-inferential PCA but that strong departures from normality may diminish the observed correlations. As data mining applications usually do not involve inference, we will not worry about normality.

44.9 Predictors SPLOM

  • Rooms, bedrooms, population, and households all appear to be positively correlated.
  • Latitude and longitude appear to be negatively correlated.
    • Scatter plot between them looks like the State of California
  • House Median Age appears to be correlated the least with the other predictors
(C34P02) Houses: SPLOM (8)

Figure 44.2 (C34P02) Houses: SPLOM (8)

44.10 Predictors Corplot

(C34P03) Houses: Corrplot (8)

Figure 44.3 (C34P03) Houses: Corrplot (8)

44.11 Correlation

Correlation Matrix

Table 44.2: (C34T02) Houses: Correlation Matrix
income h_age rooms bedrooms population households latitude longitude
income
h_age -0.117
rooms 0.193 -0.36
bedrooms -0.013 -0.32 0.93
population 0.001 -0.29 0.86 0.88
households 0.008 -0.3 0.92 0.98 0.9
latitude -0.078 0.01 -0.04 -0.07 -0.1 -0.07
longitude -0.017 -0.11 0.05 0.07 0.1 0.06 -0.9

Matrices

  • cor() : Correlation Function produces a Matrix
    • Matirces has HUGE number of problems but unfortunately some function output is in that form
    • names() does NOT work on Matrices but colnames() works, even though names() is superior to colnames() in all other aspects
    • Symmatrical Matrix. Diagonal and one of the Triangles (Upper or Lower) are redundant
    • Too many decimal printing
# #cor() produces a Matrix
ii <- cor(zw)
str(ii)
##  num [1:8, 1:8] 1 -0.11702 0.192888 -0.012576 0.000955 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:8] "income" "h_age" "rooms" "bedrooms" ...
##   ..$ : chr [1:8] "income" "h_age" "rooms" "bedrooms" ...
ii
##                  income       h_age       rooms    bedrooms   population   households    latitude
## income      1.000000000 -0.11701992  0.19288773 -0.01257616  0.000954879  0.008423239 -0.07844076
## h_age      -0.117019916  1.00000000 -0.35816935 -0.31683614 -0.292810501 -0.299390249  0.01077015
## rooms       0.192887732 -0.35816935  1.00000000  0.93037756  0.857256554  0.919196762 -0.03702508
## bedrooms   -0.012576164 -0.31683614  0.93037756  1.00000000  0.877848495  0.980039774 -0.06683883
## population  0.000954879 -0.29281050  0.85725655  0.87784849  1.000000000  0.907065416 -0.10872704
## households  0.008423239 -0.29939025  0.91919676  0.98003977  0.907065416  1.000000000 -0.07121960
## latitude   -0.078440760  0.01077015 -0.03702508 -0.06683883 -0.108727036 -0.071219604  1.00000000
## longitude  -0.016593996 -0.10712695  0.04526591  0.06873347  0.100075888  0.055907965 -0.92508077
##              longitude
## income     -0.01659400
## h_age      -0.10712695
## rooms       0.04526591
## bedrooms    0.06873347
## population  0.10007589
## households  0.05590797
## latitude   -0.92508077
## longitude   1.00000000
#
# #We can eliminate Lower Triangle and Diagonal. However NA does not print well with format()
# #outcome of upper.tri() is easily compared to as.table(). lower.tri() will need extra step
#
# #Take advatage of Matrix Triangle and Set to 0 for later handling by format()
# #IF we remove the diagonal then dimensions gets haywire i.e. 8 to 7 columns left [28] elements
# #IF we keep the diagonal then dimensions gets haywire i.e. 8 to 9 columns left [36] elements
# #So, cannot use NA, has to use ZERO (So that, later, format can replace it.)
#
# #However, we finally went ahead with dplyr solution which handled NA separately from format()
# #Thus eliminating need of assigning 0, NA are being used for redundant triangle and diagonal
#
kk <- ii #ii is FULL 8x8 Matrix
kk[upper.tri(kk, diag = TRUE)] <- NA 
mm  <- kk %>% as.table() %>% as_tibble(.name_repair = 'unique') %>% drop_na() %>% 
  filter(n > 0.5 | n < -0.5) %>% arrange(desc(abs(n)))
#
kk <- ii #ii is FULL 8x8 Matrix
nn  <- kk %>% as.table() %>% as_tibble(.name_repair = 'unique') %>% 
    filter(...1 != ...2) %>% 
    filter(!duplicated(paste0(pmax(as.character(...1), as.character(...2)), 
                              pmin(as.character(...1), as.character(...2))))) %>%
  filter(n > 0.5 | n < -0.5) %>% arrange(desc(abs(n)))
stopifnot(identical(mm, nn))
#
# #However, above are long because of the usage of as.table(). For Wide:
ll <- kk %>% #as.table() %>% 
  as_tibble() %>% 
  mutate(ID = row_number()) 
# #Get Names
oo <- names(ll)
ll %>% mutate(across(-ID, ~ ifelse(ID <= match(cur_column(), oo), NA, .x))) %>% select(-ID)
## # A tibble: 8 x 8
##      income   h_age   rooms bedrooms population households latitude longitude
##       <dbl>   <dbl>   <dbl>    <dbl>      <dbl>      <dbl>    <dbl> <lgl>    
## 1 NA        NA      NA       NA          NA        NA        NA     NA       
## 2 -0.117    NA      NA       NA          NA        NA        NA     NA       
## 3  0.193    -0.358  NA       NA          NA        NA        NA     NA       
## 4 -0.0126   -0.317   0.930   NA          NA        NA        NA     NA       
## 5  0.000955 -0.293   0.857    0.878      NA        NA        NA     NA       
## 6  0.00842  -0.299   0.919    0.980       0.907    NA        NA     NA       
## 7 -0.0784    0.0108 -0.0370  -0.0668     -0.109    -0.0712   NA     NA       
## 8 -0.0166   -0.107   0.0453   0.0687      0.100     0.0559   -0.925 NA

f_pKblM()

f_pKblM <- function(x, caption, isTri = TRUE, isDiag = FALSE, negPos = c(-0.0000001, 0.0000001), dig = 1L, ...) {
# #Description: 
# Prints Kable Matrix Standard Format: f_pKblM(hh, cap_hh)
# Calls: f_pKbl()
# #Arguments: 
# x: Matrix
# caption: Table Title with Table Number in "(AXXTYY)" Format
# isTri: When TRUE (Default) prints complete Matrix otherwise Lower Triangle Only
# negPos: Vector of 2 values, to apply 3 colours to labels
# dig: Number of decimal places
# ... : Everything else is passed to f_pKbl()
#
  stopifnot(identical(length(negPos), 2L))
#
# #outcome of upper.tri() is easily compared to as.table(). lower.tri() will need extra step
  if(isTri) x[upper.tri(x, diag = !isDiag)] <- NA
#
# #Suppress Warnings because 1 column is completely NA on which mutate(across()) is applied
# #Keeping the column is better to be seen as Matrix in this specific case of Correlation Matrix
# #Warning messages: no non-missing arguments to min; returning Inf
# #Warning messages: no non-missing arguments to max; returning -Inf
#
  x <- suppressWarnings(x %>%
# #Using as.table() gives long, otherwise wide
    #as.table() %>%
    as_tibble(rownames = NA, .name_repair = 'unique') %>%
# #Value based conditional formatting needs to happen before kbl() is called because 
# #mutate() does not work on kbl
# #format() needs to be called inside cell_spec() itself 
# #format cannot be applied later because once the value is modified for kbl() it becomes numeric
# #format cannot be applied before because it changes the value to character 
    mutate(across(everything(),
                  ~ cell_spec(ifelse(is.na(.x), "",
                      format(.x, digits = dig, scientific = FALSE, drop0trailing = TRUE)),
# #Change na_font_size to 1 or higher number to see bigger visual blobs on NA
                      font_size = spec_font_size(abs(.x), na_font_size = 0),
                      color = ifelse(.x < 0 | is.na(.x), "black", "black"),
                      background = case_when(is.na(.x) ~ "black",
                                       .x < negPos[1] ~ "#D8B365",
                                       .x >= negPos[2] ~ "#5AB4AC",
                                       TRUE ~ "grey")))))
  result <- f_pKbl(x, caption = caption, ...)
  return(result)
# #xxCLOSE: f_pKblM()
}

f_pKbl()

f_pKbl <- function(x, caption, headers = names(x), debug = FALSE, maxrows = 30L) {
# #Print Kable Standard Formats: f_pKbl(hh, cap_hh, headers = names_hh, debug = TRUE)
# #Kable Prints FULL DATASET passed to it.
# #names() does NOT work on Matrices but colnames() works
# #even though names() is superior to colnames() in all other aspects
# #We can do a conditional check on type ane then call relevant function but for now
# #Supply colnames() manually if using Matrices
#
  if(nrow(x) > maxrows) {
  #Print only the Head of Big Datasets by checking if it has more rows than maxrows
    x <- head(x)
  }
  txt_colour  <- ifelse(debug, "black", "white")
  result <- kbl(x,
    caption = cap_hh,
    col.names = headers,
    escape = FALSE, align = "c", booktabs = TRUE
    ) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                  html_font = "Consolas",   font_size = 12,
                  full_width = FALSE,
                  #position = "float_left",
                  fixed_thead = TRUE
    ) %>%
# #Header Row Dark & Bold: RGB (48, 48, 48) =HEX (#303030)
      row_spec(0, color = "white", background = "#303030", bold = TRUE,
               extra_css = "border-bottom: 1px solid; border-top: 1px solid"
      ) %>% row_spec(row = 1:nrow(x), color = txt_colour)
  return(result)
# #xxCLOSE: f_pKbl()
}

44.12 PCA

  • There are two general methods to perform PCA in R :
    • Spectral decomposition which examines the covariances / correlations between variables
      • princomp()
        • It uses divisor \(N\) for the covariance matrix.
      • Names: “sdev, loadings, center, scale, n.obs, scores, call”
    • Singular value decomposition (SVD) which examines the covariances / correlations between individuals
      • SVD has slightly better numerical accuracy
      • prcomp()
        • Unlike princomp, variances are computed with the usual divisor \(N - 1\).
        • cov() also uses \(N - 1\)
        • Use sdev or sqrt(Eigenvalues) to convert Eigenvectors into Loadings (For BOOK /psych Comparison)
      • Names: “sdev, rotation, center, scale, x”
  • Output contains
    • SD of Principal Components
    • rotation / loadings: the matrix of variable loadings (columns are eigenvectors)
    • x / scores: The coordinates of the individuals (observations) on the principal components.
  • Understanding the result
    • PCA was carried out on the eight predictors in the house data set.
    • PCA was carried out on the eight predictors in the house data set. The component matrix is shown in Table 44.3.
    • Each of the columns in Table represents one of the compnonents \(Y_i = e_i^T\mathbf{Z}\).
    • The cell entries are called the component weights, and represent the partial correlation between the variable and the component.
      • As the component weights are correlations, they range between one and negative one.
      • NOTE: Sign may differ from the Book.
      • Eigenvalues are given by \({s}^2\) as shown in Table 44.4
      • First Eigenvalue is 3.9 and there are 8 predictor variables, thus, first component (PC1) explains \(3.9/8 \approx 48\%\) of the variance
        • i.e. this single component by itself carries about half of the information in all eight predictors.
        • In general, the first principal component may be viewed as the single best summary of the correlations among the predictors. Specifically, this particular linear combination of the variables accounts for more variability than any other linear combination.
        • The second principal component is the second-best linear combination of the variables, on the condition that it is orthogonal to the first principal component. It is derived from the variability that is left over, once the first component has been accounted for.
Definition 44.8 Two vectors are orthogonal if they are mathematically independent, have no correlation, and are at right angles to each other.

Component Matrix

# #Perform PCA by prcomp() #wwww
#ii <- princomp(zw)
pca_zw <- prcomp(zw)
#
names(pca_zw)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
#
# #Principal components have "loadings" i.e. $rotation and "scores" i.e. $x
# #Loadings specify the weight that each variable contributes to the principal component.
# #Scores show the value each sample has on each principal component.
#
dim(pca_zw$rotation)
## [1] 8 8
dim(pca_zw$x)
## [1] 18576     8
#
# #Matrix Multiplication i.e. %*% of original variables with loadings gives scores
bb <- as.matrix(zw) %*% pca_zw$rotation
all.equal(bb, pca_zw$x)
## [1] TRUE
identical(round(bb, 5), round(pca_zw$x, 5))
## [1] TRUE
#
summary(pca_zw)$importance
##                             PC1      PC2      PC3       PC4       PC5       PC6       PC7       PC8
## Standard deviation     1.975794 1.381294 1.035965 0.9078472 0.3852156 0.2846245 0.2162878 0.1211411
## Proportion of Variance 0.487970 0.238500 0.134150 0.1030200 0.0185500 0.0101300 0.0058500 0.0018300
## Cumulative Proportion  0.487970 0.726470 0.860620 0.9636400 0.9821900 0.9923200 0.9981700 1.0000000
#
pca_eigen <- summary(pca_zw)$importance %>% t() %>% as_tibble(rownames = "PCA") %>% 
  rename(SD = 2, pVar = 3, pVarCum = 4) %>% 
  mutate(EigenVal = SD^2, pVarManual = EigenVal/sum(EigenVal), 
         isOne = ifelse(EigenVal > 1, "Yes", "No"),
         isNinty = ifelse(pVarCum < 0.9, "Yes", "No"))
Table 44.3: (C34T03) Houses: PCA Component Matrix
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Income (Median) -0.04 0.03 -0.89 0.41 0.06 -0.06 -0.17 -0.041
House Age (Median) 0.22 -0.02 0.4 0.88 -0.03 0.09 -0.04 -0.004
Rooms (Total) -0.48 -0.07 -0.09 0.11 -0.32 0.56 0.55 0.152
Bedrooms (Total) -0.49 -0.06 0.12 0.06 -0.38 -0.23 -0.22 -0.703
Population -0.47 -0.03 0.11 0.08 0.85 0.13 -0.02 -0.133
Households -0.49 -0.06 0.11 0.1 -0.14 -0.4 -0.31 0.678
Latitude 0.07 -0.7 -0.01 -0.1 -0.05 0.47 -0.52 0.035
Longitude -0.08 0.7 0.05 -0.07 -0.1 0.48 -0.5 0.048

Code

# #Perform PCA by prcomp() #wwww
#ii <- princomp(zw)
pca_zw <- prcomp(zw)
#
names(pca_zw)
## [1] "sdev"     "rotation" "center"   "scale"    "x"
#
# #Principal components have "loadings" i.e. $rotation and "scores" i.e. $x
# #Loadings specify the weight that each variable contributes to the principal component.
# #Scores show the value each sample has on each principal component.
#
dim(pca_zw$rotation)
## [1] 8 8
dim(pca_zw$x)
## [1] 18576     8
#
# #Matrix Multiplication i.e. %*% of original variables with loadings gives scores
bb <- as.matrix(zw) %*% pca_zw$rotation
all.equal(bb, pca_zw$x)
## [1] TRUE
identical(round(bb, 5), round(pca_zw$x, 5))
## [1] TRUE
#
summary(pca_zw)$importance
##                             PC1      PC2      PC3       PC4       PC5       PC6       PC7       PC8
## Standard deviation     1.975794 1.381294 1.035965 0.9078472 0.3852156 0.2846245 0.2162878 0.1211411
## Proportion of Variance 0.487970 0.238500 0.134150 0.1030200 0.0185500 0.0101300 0.0058500 0.0018300
## Cumulative Proportion  0.487970 0.726470 0.860620 0.9636400 0.9821900 0.9923200 0.9981700 1.0000000
#
pca_eigen <- summary(pca_zw)$importance %>% t() %>% as_tibble(rownames = "PCA") %>% 
  rename(SD = 2, pVar = 3, pVarCum = 4) %>% 
  mutate(EigenVal = SD^2, pVarManual = EigenVal/sum(EigenVal), 
         isOne = ifelse(EigenVal > 1, "Yes", "No"),
         isNinty = ifelse(pVarCum < 0.9, "Yes", "No"))

Loadings and Eigenvectors

#wwww
# #Component Matrix of prcomp() does not match with either BOOK or psych::principal()
# #prcomp() rotation contains eigenvectors not loadings. 
# #Loadings = Eigenvectors * sqrt(Eigenvalues) = Eigenvectors * sdev
#
# #psych::principal()
psy_zw <- principal(zw, nfactors = ncol(zw), rotate = 'none', scores = TRUE)
names(psy_zw)
##  [1] "values"       "rotation"     "n.obs"        "communality"  "loadings"     "fit"         
##  [7] "fit.off"      "fn"           "Call"         "uniquenesses" "complexity"   "chi"         
## [13] "EPVAL"        "R2"           "objective"    "residual"     "rms"          "factors"     
## [19] "dof"          "null.dof"     "null.model"   "criteria"     "STATISTIC"    "PVAL"        
## [25] "weights"      "r.scores"     "Vaccounted"   "Structure"    "scores"
#
#psy_zw$loadings
#
# #To Match them Multiply by SD = sqrt(Eigenvalues)
sd_pca <- summary(pca_zw)$sdev
eigen_pca <- sd_pca ^ 2
#
# #Multiply PC1 column with sqrt(Eigenvalue) of PC1 i.e. SD and so on
load_pca <- t(t(pca_zw$rotation) * sd_pca)
#
round(load_pca, 3)
##               PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8
## income     -0.083  0.047 -0.921  0.374  0.021 -0.016 -0.036 -0.005
## h_age       0.428 -0.021  0.413  0.803 -0.013  0.026 -0.009  0.000
## rooms      -0.956 -0.103 -0.097  0.104 -0.122  0.159  0.119  0.018
## bedrooms   -0.970 -0.084  0.120  0.057 -0.145 -0.066 -0.048 -0.085
## population -0.933 -0.036  0.118  0.074  0.327  0.036 -0.004 -0.016
## households -0.972 -0.088  0.112  0.087 -0.054 -0.114 -0.066  0.082
## latitude    0.145 -0.970 -0.012 -0.089 -0.018  0.133 -0.113  0.004
## longitude  -0.150  0.969  0.057 -0.063 -0.037  0.137 -0.109  0.006
round(psy_zw$loadings, 3)
## 
## Loadings:
##            PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8   
## income                    0.921  0.374                            
## h_age      -0.428        -0.413  0.803                            
## rooms       0.956  0.103         0.104  0.122  0.159 -0.119       
## bedrooms    0.970        -0.120         0.145                     
## population  0.933        -0.118        -0.327                     
## households  0.972        -0.112               -0.114              
## latitude   -0.145  0.970                       0.133  0.113       
## longitude   0.150 -0.969                       0.137  0.109       
## 
##                  PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8
## SS loadings    3.904 1.909 1.072 0.824 0.148 0.081 0.047 0.015
## Proportion Var 0.488 0.239 0.134 0.103 0.019 0.010 0.006 0.002
## Cumulative Var 0.488 0.727 0.861 0.964 0.982 0.992 0.998 1.000

44.13 Orthogonality of PCA

Image

(C34P04) Houses: PCA Corrplot - ALL are ZERO

Figure 44.4 (C34P04) Houses: PCA Corrplot - ALL are ZERO

Code

# #Correlation Matrix | Table (Long) | Tibble
hh <- cor(pca_zw$x) %>% as.table() %>% as_tibble(.name_repair = 'unique') %>% 
  #filter(...1 != ...2) %>% 
  filter(!duplicated(paste0(pmax(as.character(...1), as.character(...2)), 
                            pmin(as.character(...1), as.character(...2))))) #%>% 
  #mutate(across(where(is.character), factor, levels = c_zsyw[-1], labels = names(c_zsyw)[-1]))
#
ttl_hh <- "Houses: PCA Corrplot - ALL are ZERO"
cap_hh <- "C34P04"

44.14 ScreePlot

Image

(C34P05 C34P06) House: PCA Screeplot with Variance(C34P05 C34P06) House: PCA Screeplot with Variance

Figure 44.5 (C34P05 C34P06) House: PCA Screeplot with Variance

Table 44.4: (C34T04) Houses: PCA Eigenvalues & Variance
PCA SD exp_Variance cum_Var EigenVal isEigenOne isVarNinty
PC1 1.976 0.48797 0.488 3.9038 Yes Yes
PC2 1.381 0.2385 0.726 1.908 Yes Yes
PC3 1.036 0.13415 0.861 1.0732 Yes Yes
PC4 0.908 0.10302 0.964 0.8242 No No
PC5 0.385 0.01855 0.982 0.1484 No No
PC6 0.285 0.01013 0.992 0.081 No No
PC7 0.216 0.00585 0.998 0.0468 No No
PC8 0.121 0.00183 1 0.0147 No No
Total NA NA NA 8 NA NA

Code

hh <- pca_eigen
#
ttl_hh <- "Houses: PCA Eigenvalue ScreePlot"
cap_hh <- "C34P05"
y_hh <- "Eigenvalue"
# #IN: hh 
C34 <- hh %>% { ggplot(., aes(x = PCA, y = EigenVal)) + 
    geom_point(aes(color = isOne), size = 3) +
    geom_line(aes(group = 1)) +
    geom_hline(aes(yintercept = 1), color = '#440154FF', linetype = "dashed") +
    annotate("segment", x = 3.5, xend = 3.1, y = 1.6, 
                    yend = 1.2, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
    annotate("segment", x = 4.5, xend = 4.1, y = 1.3, 
                    yend = 0.9, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
    annotate("segment", x = 5.5, xend = 5.1, y = 0.6, 
                    yend = 0.2, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
    geom_text(data = tibble(x = c(3.5, 4.5, 5.5), y = c(1.7, 1.4, 0.7), 
              labels = c("Eigenvalue Criterion", "Screeplot Criterion", "Elbow Point")), 
              aes(x=x, y=y, label=labels), check_overlap = TRUE) + 
    scale_fill_distiller(palette = "BrBG") +
    #coord_fixed() +
    theme(legend.position = 'none') +
      labs(y = y_hh, subtitle = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C34)
rm(C34)
ttl_hh <- "Houses: PCA Proportion of Variance Explained"
cap_hh <- "C34P06"
y_hh <- NULL
# #IN: hh 
C34 <- hh %>% { ggplot(., aes(x = PCA, y = pVarCum)) + 
    geom_point(aes(color = isNinty), size = 3) +
    geom_line(aes(group = 1)) +
    geom_hline(aes(yintercept = 0.9), color = '#440154FF', linetype = "dashed") +
    annotate("segment", x = 4, xend = 4, y = 0.83, 
                    yend = 0.93, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
    geom_text(data = tibble(x = 5.5, y = 0.8, labels = c("Proportion of Variance Covered Criterion")), 
              aes(x=x, y=y, label=labels), check_overlap = TRUE) + 
    scale_fill_distiller(palette = "BrBG") +
    scale_y_continuous(limits = c(0, 1), labels = percent) + 
    theme(legend.position = 'none') +
    labs(y = y_hh,
         subtitle = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C34)
rm(C34)

44.15 How many Components

  • In our example Single Component (PC1) can account for approximately half of the variability. But, all 8 accounts for 100% variability. So between 1 and 8 where is the cut-off
    • Criteria
      • The Eigenvalue Criterion
      • The Proportion of Variance Explained Criterion
      • The Minimum Communality Criterion - Deferred by Book
      • The Scree Plot Criterion
  • The Eigenvalue Criterion
    • Refer Figure 44.5 and Table 44.4
    • Sum of the eigenvalues represents the number of variables entered into the PCA i.e. 8
    • An eigenvalue of 1 would then mean that the component would explain about “one variable worth” of the variability.
    • Therefore, the eigenvalue criterion states that only components with eigenvalues greater than 1 should be retained.
    • Note that, if there are fewer than 20 variables, the eigenvalue criterion tends to recommend extracting too few components, while, if there are more than 50 variables, this criterion may recommend extracting too many.
    • Thus in example: 3 can be retained. PC4 has value around 0.8 so it may or may not be retained.
  • The Proportion of Variance Explained Criterion
    • We can define how much of the total variability that we would like the principal components to account for and then selects them acccordingly
    • Thus in example: 3 can be retained. PC4 will be selected if more than 90% should be accounted
  • The Scree Plot Criterion
    • Elbow Point: The maximum number of components that should be extracted is just before where the plot first begins to straighten out into a horizontal line.

44.16 Factor Scores

  • Modified Table 44.3 as 44.5
  • To investigate the relationship between PC3 and PC4, and their constituent variables, we next consider factor scores. Factor scores are estimated values of the factors for each observation, and are based on factor analysis.
Table 44.5: (C34T05) Houses: PCA Eigenvectors upto PC4
PC1 PC2 PC3 PC4
Income -0.889 0.412
Age 0.216 0.399 0.885
Rooms -0.484
Beds -0.491
Pop -0.472
Houses -0.492
Lat -0.702
Long 0.701
(C34P07 C34P08) Correlation Matrices of Factor Scores of PC3 and PC4(C34P07 C34P08) Correlation Matrices of Factor Scores of PC3 and PC4

Figure 44.6 (C34P07 C34P08) Correlation Matrices of Factor Scores of PC3 and PC4

  • Consider the left side of Figure 44.6
    • The strong negative correlation between component 3 and median income is strikingly evident
    • But the relationship between component 3 and housing median age is rather amorphous. It would be difficult to estimate the correlation between component 3 and housing median age as being 0.413, with only the scatter plot to guide us.
  • Similarly for the right side of Figure 44.6
    • The relationship between component 4 and housing median age is crystal clear
    • while the relationship between component 4 and median income is not entirely clear, reflecting its lower positive correlation of 0.374.
    • We conclude, therefore, that the component weight of 0.413 for housing median age in component 3 is not of practical significance, and similarly for the component weight for median income in component 4.
  • This discussion leads us to the following criterion for assessing the component weights.
    • For a component weight to be considered of practical significance, it should exceed \(\pm 0.5\) in magnitude.
    • Note that the component weight represents the correlation between the component and the variable; thus, the squared component weight represents the amount of the total variability of the variable that is explained by the component.
    • Thus, this threshold value of \(\pm 0.5\) requires that at least 25% of the variance of the variable be explained by a particular component.
  • Thus, the Table 44.3 is modified further as 44.6
    • NOTE: Sign differ from the Book but that is OK.
    • NOTE: Valeus should be matching when eigenvectors are converted to loadings
      • Note that the partition of the variables among the four components is mutually exclusive, meaning that no variable is shared (after suppression) by any two components
      • and exhaustive, meaning that all eight variables are contained in the four components.
Table 44.6: (C34T06) Houses: PCA Loadings (not Eigenvectors) upto PC4
PC1 PC2 PC3 PC4
Income -0.921
Age 0.803
Rooms -0.956
Beds -0.97
Pop -0.933
Houses -0.972
Lat -0.97
Long 0.969

44.17 Matrix

Create

# #Create a Matrix
# #(Default) The Matrix is Filled Column by Column
matrix(1:9, nrow = 3)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
#
# #Change Behavour to fill the matrix by Row
matrix(1:9, nrow = 3, byrow = TRUE)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Multiply Vector

# #names() does not work on Matrix
mm <- matrix(1:9, ncol = 3, byrow = TRUE)
rownames(mm) <- tail(letters, 3)
colnames(mm) <- head(letters, 3)
mm
##   a b c
## x 1 2 3
## y 4 5 6
## z 7 8 9
#
vv <- 1:3
#
# #Matrix Multipley (Deprecated)
# #Multiply the Vector with Matrix Rows i.e. x * 1, y * 2, z * 3
ii <- diag(vv) %*% mm #loss of rownames because it is taken from Left Side Matrix
#
# #Multiply the Vector with Matrix Columns i.e. a * 1, b * 2, c * 3
jj <- mm %*% diag(vv) #loss of colnames because it is taken from Right Side Matrix
#
# #Check Attributes
attributes(ii)$dimnames
## [[1]]
## NULL
## 
## [[2]]
## [1] "a" "b" "c"
attributes(jj)$dimnames
## [[1]]
## [1] "x" "y" "z"
## 
## [[2]]
## NULL
#
# #Add Missing RowNames or ColNames 
rownames(ii) <- rownames(mm)
colnames(jj) <- colnames(mm)
#
# #Coercing by as.integer() will produce a vector not matrix. Use eiher mode() or class()
#ii[] <- as.integer(ii) #Even using [] is NOT coercing to integer
class(ii) <- "integer"
mode(jj) <- "integer" 
#
# #Print
ii
##    a  b  c
## x  1  2  3
## y  8 10 12
## z 21 24 27
str(ii)
##  int [1:3, 1:3] 1 8 21 2 10 24 3 12 27
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3] "x" "y" "z"
##   ..$ : chr [1:3] "a" "b" "c"
jj
##   a  b  c
## x 1  4  9
## y 4 10 18
## z 7 16 27
str(jj)
##  int [1:3, 1:3] 1 4 7 4 10 16 9 18 27
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3] "x" "y" "z"
##   ..$ : chr [1:3] "a" "b" "c"
#
# #Equivalent: sweep() (Deprecated). SLOW, However it keeps the dimnames.  
swp_ii <- sweep(mm, MARGIN = 1, vv, `*`)
swp_jj <- sweep(mm, MARGIN = 2, vv, `*`)
#
# #Recommended:
# #Equivalent: R Recycle Vector Column-wise. So double-transpose is needed if multiplying on jj.
# #Double Transpose is FASTEST & retains dimnames. Bonus: This is Commutative.
rec_ii <- mm * vv
com_ii <- vv * mm 
rec_jj <- t(t(mm) * vv)
com_jj <- t(vv * t(mm))
all(identical(rec_ii, com_ii), identical(rec_jj, com_jj)) #Commutative
## [1] TRUE
#
all(identical(ii, swp_ii), identical(jj, swp_jj), identical(ii, rec_ii), identical(jj, rec_jj))
## [1] TRUE

Square Elements

mm <- matrix(1:9, ncol = 3, byrow = TRUE)
rownames(mm) <- tail(letters, 3)
colnames(mm) <- head(letters, 3)
mm
##   a b c
## x 1 2 3
## y 4 5 6
## z 7 8 9
#
# #Square Each Element of the Matrix
mm ** 2 # ** operator is highly obscure and is actually parsed as ^ so use that not this one
##    a  b  c
## x  1  4  9
## y 16 25 36
## z 49 64 81
mm ^ 2
##    a  b  c
## x  1  4  9
## y 16 25 36
## z 49 64 81
stopifnot(identical(mm ^ 2, mm ** 2))

44.18 Communalities

Definition 44.9 PCA does not extract all the variance from the variables, but only that proportion of the variance that is shared by several variables. Communality represents the proportion of variance of a particular variable that is shared with other variables. Communality values are calculated as the sum of squared component weights, for a given variable.
  • The communalities represent the overall importance of each of the variables in the PCA as a whole.
  • Communalities that are very low for a particular variable should be an indication that the particular variable might not participate in the PCA solution (i.e., might not be a member of any of the principal components).
  • Communalities less than 0.5 can be considered to be too low, as this would mean that the variable shares less than half of its variability in common with the other variables.
    • Now, if we want to keep the variable housing median age as an active part of the analysis, then, extracting only three components would not be adequate, as housing median age shares only 35% of its variance with the other variables. To keep this variable in the analysis, we would need to extract the fourth component, which lifts the communality for housing median age over the 50% threshold.
  • Minimum Communality Criterion
    • Enough components should be extracted so that the communalities for each of these variables exceed a certain threshold (e.g., 50%)
Table 44.7: (C34T07) Houses: PCA Loadings with Communalities
PC1 PC2 PC3 PC4 Comm_PC3 Comm_PC4
Income -0.083 0.047 -0.921 0.374 0.86 1
Age 0.428 -0.021 0.413 0.803 0.35 1
Rooms -0.956 -0.103 -0.097 0.104 0.93 0.95
Beds -0.97 -0.084 0.12 0.057 0.96 0.97
Pop -0.933 -0.036 0.118 0.074 0.89 0.89
Houses -0.972 -0.088 0.112 0.087 0.97 0.97
Lat 0.145 -0.97 -0.012 -0.089 0.96 0.97
Long -0.15 0.969 0.057 -0.063 0.96 0.97

44.19 Decision on How many Components

  • The Eigenvalue Criterion recommended 3 components, but did not absolutely reject the 4 component. Also, for small numbers of variables, this criterion can underestimate the best number of components to extract.

  • The Proportion of Variance Explained Criterion stated that we needed to use 4 components if we wanted to account for >90% of the variability.

  • The Scree Plot Criterion said not to exceed 4 components.

  • The Minimum Communality Criterion stated that, if we wanted to keep housing median age in the analysis, we had to extract 4 components.

  • Conclusion: PC4 is included.

  • Test Dataset

    • We can perform PCA on Test dataset also and that should provide us similar pattern of PC1 to PC4 in terms of eigenvectors and loadings (not exactly same but similar). This should be taken as the confirmation that PCA of Training Data can be applied on Test Data
    • If the results differ too much then it should be taken as the indication of traning data being not a representation of test data.

44.20 Factor Analysis

Factor Analysis (FA) is related to PCA but the two methods have different goals.

Principal components seek to identify orthogonal linear combinations of the variables, to be used either for descriptive purposes or to substitute a smaller number of uncorrelated components for the original variables.

In contrast, factor analysis represents a model for the data, and as such is more elaborate.

The factor analysis model hypothesizes that the response vector \({\{X_1, X_2, \ldots, X_m\}}\) can be modeled as linear combinations of a smaller set of \({k}\) unobserved, “latent” random variables \({\{F_1, F_2, \ldots, F_k\}}\) called common factors, along with an error term \(\mathbf{\epsilon} = {\{\epsilon_1, \epsilon_2, \ldots, \epsilon_m\}}\). Specifically, the factor analysis model is :

\[\underset{m \times 1}{\mathbf{X - \mu}} = \underset{m \times k}{\mathbf{L}} \, \underset{k \times 1}{\mathbf{F}} + \underset{m \times 1}{\mathbf{\epsilon}} \tag{44.1}\]

Where \(\underset{m \times 1}{\mathbf{X - \mu}}\) is the response vector, centered by the mean vector, \(\underset{m \times k}{\mathbf{L}}\) is the matrix of factor loadings, with \(l_{ij}\) representing the factor loading of the \(i^{\text{th}}\) variable on the \(j^{\text{th}}\) factor, \(\underset{k \times 1}{\mathbf{F}}\) represents the vector of unobservable common factors, and \(\underset{m \times 1}{\mathbf{\epsilon}}\) represents the error vector.

The factor analysis model differs from other models, such as the linear regression model, in that the predictor variables \({\{F_1, F_2, \ldots, F_k\}}\) are unobservable. Because so many terms are unobserved, further assumptions must be made before we may uncover the factors from the observed responses alone.

These assumptions are that \(E(\mathbf{F}) = \mathbf{0}, \text{Cov}(\mathbf{F}) = \mathbf{I}, E(\mathbf{\epsilon}) = \mathbf{0}, \text{Cov}(\mathbf{\epsilon})\) is a diagonal matrix.

Unfortunately, the factor solutions provided by factor analysis are invariant to transformations. Hence, the factors uncovered by the model are in essence nonunique, without further constraints. This indistinctness provides the motivation for factor rotation.

44.21 UCI Data Repository

44.22 Data Adult

Please import the "C34-adult.csv"

  • Source: https://archive-beta.ics.uci.edu/ml/datasets/adult
  • About: Train [32561, 15] & Test[16281, 15] = Total[48842, 15]
    • The intended task is to find the set of demographic characteristics that can best predict whether or not the individual has an income of over 50000 dollars per year.
  • Steps (External) Luke Perich
    • Merged Train and Test for easy and simultaneous cleaning. Source Column attached for easy separation later.
    • ‘fnlwgt’ (stands for final weight) - Dropped
      • It has no predictive power since it is a feature aimed to allocate similar weights to people with similar demographic characteristics.
    • ‘income’
      • Edited “.” in Text and Modified to 0 (<= 50K) and 1 (>50K) as Factor for Summary
    • ‘Education’ is dropped
      • It is just a label on ‘education_num’ (number of years of education). (Not Tested “ForLater”)
    • ‘marital_status’
      • Number of Levels reduced by merging 3 types of Married. Two of them have small count.
    • ‘age’ - num - Min 17 to Max 90 All Numbers are Present
    • ‘education_num’ - num - Min 1 to Max 16 All Numbers are Present
    • ‘hours_per_week’ - num -
      • 3 Hours are missing but that can happen
      • cannot remove 99 hours because 98 hours and other nearby hours are also present in the data
    • ‘capital_gain’ & ‘capital_loss’ - Both removed
      • After 41310 dollars, There is directly 99999 dollars which cannot be correct. It should be set to Median if it is kept
      • 44807 observations have 0 capital gain
      • 46560 observations have 0 capital loss
    • ‘workclass,’ ‘occupation,’ ‘native_country’
      • These 3 contain Question Mark which have been converted to NA but not removed from dataset to have the possibility of imputation or to keep their other column variable information for analysis
    • ‘native_country’
      • With huge bias towards US there is no point in having so many countries or regions even.
      • Changed to Binary Factors of USA and Other

Import

Processing

# #Merge Tibbles with ID Names in Column
# #NA Introduced by changing Question Mark to NA
aa <- bind_rows(Train = tbl_aa, Test = tbl_bb, .id = 'source') 
#
bb <- aa %>% 
  select(-c(fnlwgt, education, capital_gain, capital_loss)) %>% 
  mutate(Income = ifelse(Income == "<=50K" | Income == "<=50K.", "0", "1")) %>% 
  mutate(across(c(workclass, occupation, native_country), ~na_if(., "?"))) %>% 
  mutate(native_country = ifelse(str_detect(native_country, 
            paste0(c("United-", "Outlying-US"), collapse = "|")), "USA", "Other")) %>% 
  mutate(across(where(is.character), ~ factor(., levels = unique(.)))) %>% 
  mutate(marital_status = 
    fct_collapse(marital_status, 
      Married = c("Married-civ-spouse", "Married-spouse-absent", "Married-AF-spouse"))) 
#
xxC34Adult <- bb
f_setRDS(xxC34Adult)

Merge Factor Levels

# #Levels of each Factor Variable
#lapply(aa[ , sapply(aa, is.factor)], levels)
#levels(aa$marital_status)
summary(aa$marital_status)
ii <- aa %>% select(marital_status) %>% 
  mutate(marital_status = 
    fct_collapse(marital_status, 
      Married = c("Married-civ-spouse", "Married-spouse-absent", "Married-AF-spouse")))
#summary(ii$marital_status)

Check Numeric

# #Check Numeric Columns by summary()
if(TRUE) aa %>% select(!where(is.numeric)) %>% summary()
if(TRUE) sort(unique(aa$age))
if(TRUE) length(sort(unique(aa$age)))
if(TRUE) aa %>% count(age) %>% mutate(PROP = n/sum(n)) #%>% arrange(desc(n)) %>% head(10)
#
# #Find Missing Numbers in a Sequence of Numbers
#ii <- unique(aa$age) 
ii <- unique(aa$hours_per_week)
jj <- min(ii):max(ii)
jj[!jj %in% ii]
#
# #Equivalent
setdiff(jj, ii)

Search String

# #To Search For Question Mark in All Factor Columns, Question Mark needs to be escaped
# #The Backslash used for escaping itself needs to be escaped using Backslash
aa %>% rowwise() %>%
  mutate(find_me = any(str_detect(c_across(where(is.factor)), 
                                  regex("\\?", ignore_case = TRUE)), na.rm = TRUE)) %>% 
  filter(find_me)
#
# #To Get the Column Names containing a String i.e. '?'
which(vapply(aa, function(x) any(stri_detect(x, regex = "\\?", max_count = 1)), logical(1)))

Replace Mutiple Partial Matches

# #Search and Replace for Multiple Partial Matches 
# #NOTE: "|" should be used to collpase NOT " | "
# #NOTE: Question Marks Replaced as Other
aa %>% mutate(native_country = ifelse(str_detect(native_country, 
            paste0(c("United-", "Outlying-US"), collapse = "|")), "USA", "Other")) %>% 
  count(native_country)

Check Factor

# #Check Factor Columns by summary()
if(TRUE) aa %>% select(!where(is.factor)) %>% summary()
ii <- factor(aa$native_country)
if(TRUE) levels(ii)
if(TRUE) nlevels(ii)
aa %>% count(native_country) #%>% tail(10) 

44.23 Correlation Matrix

Note that the correlations, although statistically significant in several cases, are overall much weaker than the correlations from the ‘houses’ data set. A weaker correlation structure should pose more of a challenge for the dimension-reduction method.

NOTE: While the Book created ‘Net Captial,’ I have skipped that because Capital Gain and Capital Loss Columns have extremely high number of zeroes. Further, an ID column ‘fnlwgt’ was also dropped.

Table 44.8: (C34T08) Adult: Correlation Matrix
age edu hours
age
edu 0.0363
hours 0.0699 0.15

Factor analysis requires a certain level of correlation in order to function appropriately. The following tests have been developed to ascertain whether there exists sufficiently high correlation to perform factor analysis.

  • Note, however, that statistical tests in the context of huge data sets can be misleading. With huge sample sizes, even the smallest effect sizes become statistically significant. This is why data mining methods rely on cross-validation methodologies, not statistical inference.

  • The proportion of variability within the standardized predictor variables which is shared in common, and therefore might be caused by underlying factors, is measured by the Kaiser-Meyer-Olkin (KMO) Measure of Sampling Adequacy.

    • Values of the KMO statistic less than 0.50 indicate that factor analysis may not be appropriate.
  • Bartlett Test of Sphericity tests the null hypothesis that the correlation matrix is an identity matrix, that is, that the variables are really uncorrelated.

  • The statistic reported is the p-value, so that very small values would indicate evidence against the null hypothesis, that is, the variables really are correlated.

  • For p-values much larger than 0.10, there is insufficient evidence that the variables are correlated, and so factor analysis may not be suitable.

  • It compares an observed correlation matrix to the identity matrix.

    • Essentially it checks to see if there is a certain redundancy between the variables that we can summarize with a few number of factors.
    • The null hypothesis of the test is that the variables are orthogonal, i.e. not correlated.
    • If the numbers in the matrix represent correlation coefficients, like Identity Matrix, it means that each variable is perfectly orthogonal (i.e. “uncorrelated”) to every other variable and thus a data reduction technique like PCA or factor analysis would not be able to “compress” the data in any meaningful way.
  • The KMO statistic has a value of 0.52, which is not less than 0.5, meaning that this test does not find the level of correlation to be too low for factor analysis.

  • The p-value for Bartlett Test of Sphericity rounds to zero, so that the null hypothesis that no correlation exists among the variables is rejected. We therefore proceed with the factor analysis.

44.24 KMO Test

# #KMO Test: Measure of Sampling Adequacy (MSA) 
KMO(cor(adl_zw))
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(adl_zw))
## Overall MSA =  0.52
## MSA for each item = 
##   age   edu hours 
##  0.56  0.51  0.51

44.25 Bartlett Test of Sphericity

  • Caution: It is NOT same as ‘Bartlett Test for Equality of Variances’
  • This tests requires multivariate normality. If this condition is not met, KMO can still be used.
# #Bartlett Test of Sphericity
bartsph <- cortest.bartlett(cor(adl_zw), n = nrow(adl_zw))
if(bartsph$p.value < 0.05) {
  cat("Null Rejected. Variables are Correlated. Dimension Reduction can be performed.\n")
} else {
  cat("Failed to reject the Null. Uncorrelated Variables. No benefit in Dimension Reduction.\n")
}
## Null Rejected. Variables are Correlated. Dimension Reduction can be performed.

44.26 FA

  • To allow us to view the results using a scatter plot, we decide a priori to extract only two factors.

  • The following factor analysis is performed using the principal axis factoring option.

    • In principal axis factoring, an iterative procedure is used to estimate the communalities and the factor solution.
    • This particular analysis required 152 such iterations before reaching convergence.
    • The eigenvalues and the proportions of the variance explained by each factor are shown in Table
  • fa()

    • Trying with ‘pa’ as given in Book, even though ‘pa’ produces warnings whereas ‘minres’ does not
  • Warning: The estimated weights for the factor scores are probably incorrect. Try a different factor score estimation method.

    • It is coming most probably because age and edu might be serially correlated i.e. 
      • As the Age is increasing, education might also increasing with a lag. Age values might be shifted forward in time with Education values
  • Warning: An ultra-Heywood case was detected. Examine the results carefully

    • communality > 1
    • Heywood cases should be treated as invalid.
    • Try to reduce the number of factors, try other initial communalities (in PAF method), try to drop variables with low KMO, check multicollinearity
    • These are encountered typically when there are too few variables to support the requested number of factors.
      • Both Warnings go away when requested factors were reduced from 3 to 2
ERROR 44.1 Error in if (prod(R2) < 0) : missing value where TRUE/FALSE needed
  • Add ‘SMC = FALSE’ in fa()

  • NOTE

    • Cummulative Variance is only 48% i.e. less than half is explained by first two factors extracted.
      • In contrast, the Housing data had ~76% explained by first two factors because there the correlation was strong

pa

adl_fa <- fa(adl_zw, nfactors = 2, #ncol(adl_zw)
   fm = "pa", rotate = "none", SMC = FALSE)
# #Loadings
adl_fa$loadings
## 
## Loadings:
##       PA1    PA2   
## age    0.983       
## edu           0.633
## hours         0.225
## 
##                  PA1   PA2
## SS loadings    0.982 0.457
## Proportion Var 0.327 0.152
## Cumulative Var 0.327 0.480
#
# #Values
adl_fa$values
## [1] 0.9817285949 0.4567784008 0.0007492974
#
# #Communalities
adl_fa$communalities
## [1] 1.438507

pa

# #Warnings
# #The estimated weights for the factor scores are probably incorrect.  
# #Try a different factor score estimation method.
# #An ultra-Heywood case was detected.  Examine the results carefully
fa(adl_zw, nfactors = 2, #ncol(adl_zw)
   fm = "pa", rotate = "none", SMC = FALSE)
## Factor Analysis using method =  pa
## Call: fa(r = adl_zw, nfactors = 2, rotate = "none", SMC = FALSE, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        PA1   PA2    h2    u2 com
## age   0.98 -0.08 0.972 0.028 1.0
## edu   0.09  0.63 0.408 0.592 1.0
## hours 0.09  0.23 0.058 0.942 1.3
## 
##                        PA1  PA2
## SS loadings           0.98 0.46
## Proportion Var        0.33 0.15
## Cumulative Var        0.33 0.48
## Proportion Explained  0.68 0.32
## Cumulative Proportion 0.68 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  3  and the objective function was  0.03 with Chi Square of  706.35
## The degrees of freedom for the model are -2  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0 
## The df corrected root mean square of the residuals is  NA 
## 
## The harmonic number of observations is  25000 with the empirical chi square  0  with prob <  NA 
## The total number of observations was  25000  with Likelihood Chi Square =  0  with prob <  NA 
## 
## Tucker Lewis Index of factoring reliability =  1.004
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA1   PA2
## Correlation of (regression) scores with factors   0.98  0.66
## Multiple R square of scores with factors          0.97  0.43
## Minimum correlation of possible factor scores     0.94 -0.14

With cor()

# #Warnings
# #The estimated weights for the factor scores are probably incorrect.  
# #Try a different factor score estimation method.
# #An ultra-Heywood case was detected.  Examine the results carefully
fa(cor(adl_zw), nfactors = 2, #ncol(adl_zw)
   fm = "pa", rotate = "none", n.obs = 25000, SMC = FALSE)
## Factor Analysis using method =  pa
## Call: fa(r = cor(adl_zw), nfactors = 2, n.obs = 25000, rotate = "none", 
##     SMC = FALSE, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        PA1   PA2    h2    u2 com
## age   0.98 -0.08 0.972 0.028 1.0
## edu   0.09  0.63 0.408 0.592 1.0
## hours 0.09  0.23 0.058 0.942 1.3
## 
##                        PA1  PA2
## SS loadings           0.98 0.46
## Proportion Var        0.33 0.15
## Cumulative Var        0.33 0.48
## Proportion Explained  0.68 0.32
## Cumulative Proportion 0.68 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  3  and the objective function was  0.03 with Chi Square of  706.35
## The degrees of freedom for the model are -2  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0 
## The df corrected root mean square of the residuals is  NA 
## 
## The harmonic number of observations is  25000 with the empirical chi square  0  with prob <  NA 
## The total number of observations was  25000  with Likelihood Chi Square =  0  with prob <  NA 
## 
## Tucker Lewis Index of factoring reliability =  1.004
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA1   PA2
## Correlation of (regression) scores with factors   0.98  0.66
## Multiple R square of scores with factors          0.97  0.43
## Minimum correlation of possible factor scores     0.94 -0.14

minres

# #No Warning
fa(adl_zw, nfactors = ncol(adl_zw), fm = "minres", rotate = "none")
## Factor Analysis using method =  minres
## Call: fa(r = adl_zw, nfactors = ncol(adl_zw), rotate = "none", fm = "minres")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        MR1   MR2 MR3    h2   u2 com
## age   0.15  0.14   0 0.042 0.96 2.0
## edu   0.36 -0.12   0 0.144 0.86 1.2
## hours 0.43  0.05   0 0.187 0.81 1.0
## 
##                        MR1  MR2  MR3
## SS loadings           0.34 0.04 0.00
## Proportion Var        0.11 0.01 0.00
## Cumulative Var        0.11 0.12 0.12
## Proportion Explained  0.90 0.10 0.00
## Cumulative Proportion 0.90 1.00 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  3  and the objective function was  0.03 with Chi Square of  706.35
## The degrees of freedom for the model are -3  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0 
## The df corrected root mean square of the residuals is  NA 
## 
## The harmonic number of observations is  25000 with the empirical chi square  0  with prob <  NA 
## The total number of observations was  25000  with Likelihood Chi Square =  0  with prob <  NA 
## 
## Tucker Lewis Index of factoring reliability =  1.004
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                     MR1   MR2 MR3
## Correlation of (regression) scores with factors    0.54  0.20   0
## Multiple R square of scores with factors           0.29  0.04   0
## Minimum correlation of possible factor scores     -0.43 -0.92  -1

44.27 Factor Rotation

To assist in the interpretation of the factors, factor rotation may be performed. Factor rotation corresponds to a transformation (usually orthogonal) of the coordinate axes, leading to a different set of factor loadings.

The sharpest focus occurs when each variable has high factor loadings on a single factor, with low-to-moderate loadings on the other factors.

  • “ForLater” Figure 4.6 Page 115 - Biplot

  • No significant difference was observed with different Rotations, unlike BOOK.

# #Huge number of Rotations are available including "none", "varimax", "quartimax", "equamax" ...
# #No significant difference observed
fa(adl_zw, nfactors = 2, fm = "pa", rotate = "none", SMC = FALSE)$loadings
## 
## Loadings:
##       PA1    PA2   
## age    0.983       
## edu           0.633
## hours         0.225
## 
##                  PA1   PA2
## SS loadings    0.982 0.457
## Proportion Var 0.327 0.152
## Cumulative Var 0.327 0.480
#
fa(adl_zw, nfactors = 2, fm = "pa", rotate = "varimax", SMC = FALSE)$loadings
## 
## Loadings:
##       PA1    PA2   
## age    0.982       
## edu           0.638
## hours         0.236
## 
##                  PA1   PA2
## SS loadings    0.968 0.471
## Proportion Var 0.323 0.157
## Cumulative Var 0.323 0.480
#
fa(adl_zw, nfactors = 2, fm = "pa", rotate = "quartimax", SMC = FALSE)$loadings
## 
## Loadings:
##       PA1   PA2  
## age   0.986      
## edu         0.638
## hours       0.232
## 
##                  PA1   PA2
## SS loadings    0.978 0.461
## Proportion Var 0.326 0.154
## Cumulative Var 0.326 0.480
#
fa(adl_zw, nfactors = 2, fm = "pa", rotate = "equamax", SMC = FALSE)$loadings
## 
## Loadings:
##       PA1   PA2  
## age   0.986      
## edu         0.638
## hours       0.232
## 
##                  PA1   PA2
## SS loadings    0.977 0.461
## Proportion Var 0.326 0.154
## Cumulative Var 0.326 0.480

44.28 User Defined Composites

  • User Defined Composites or Summated Scales
    • A user-defined composite is simply a linear combination of the variables, which combines several variables together into a single composite measure.
    • The simplest user-defined composite is simply the mean of the variables.
    • When compared to the use of individual variables, user-defined composites provide a way to diminish the effect of measurement error.
      • Measurement error refers to the disparity between the observed variable values, and the “true” variable value. Measurement error contributes to the background error noise, interfering with the ability of models to accurately process the signal provided by the data, with the result that truly significant relationships may be missed.
      • User-defined composites reduce measurement error by combining multiple variables into a single measure.
    • Appropriately constructed user-defined composites allow the analyst to represent the manifold aspects of a particular concept using a single measure.
      • Thus, user-defined composites enable the analyst to embrace the range of model characteristics, while retaining the benefits of a parsimonious model.

Validation


45 Model Data

45.1 Overview

“Univariate Statistical Analysis (335)” was a summary view of Hypothesis Testing.

“Multivariate Statistics (336)” was a summary view of ANOVA, Goodness of Fit etc.

“Simple Linear Regression (338)” has been merged in Anderson C14.

“Multiple Regression and Model Building (339)” has been merged in Anderson C15.

  • “Preparing to Model the Data”

45.2 Data Mining

  • Data Mining Methods and Definitions
    • Data mining methods may be categorized as either supervised or unsupervised.
    • Most data mining methods are supervised methods.
    • Unsupervised : Clustering, PCA, Factor Analysis, Association Rules, RFM
    • Supervised :
      • Regression (Continuous Target) : Linear Regression, Regularised Regression, Decision trees, Ensemble learning
        • Linear Regression : Ridge, Lasso and Elastic Regression
        • Ensemble learning : Bagging, Boosting (AdaBoost, XGBoost), Random forests
      • Classification (Categorical Target) : Decision trees, Ensemble learning, Logistic Regression, k-nearest neighbor (k-NN), Naive-Bayes
      • Deep Learning : Neural Networks
Definition 45.1 In unsupervised methods, no target variable is identified as such. Instead, the data mining algorithm searches for patterns and structures among all the variables. The most common unsupervised data mining method is clustering. Ex: Voter Profile.
Definition 45.2 Supervised methods are those in which there is a particular prespecified target variable and the algorithm is given many examples where the value of the target variable is provided. This allows the algorithm to learn which values of the target variable are associated with which values of the predictor variables.

45.3 Statistical Inference vs. Data Mining

  • Statistical methodology and data mining methodology differ in the following two ways:
    • Applying statistical inference using the huge sample sizes encountered in data mining tends to result in statistical significance, even when the results are not of practical significance.
    • In statistical methodology, the data analyst has an a priori hypothesis in mind. Data mining procedures usually do not have an a priori hypothesis.
Definition 45.3 An a priori hypothesis is one that is generated prior to a research study taking place.

45.4 Cross-validation

Definition 45.4 Cross-validation is a technique for insuring that the results uncovered in an analysis are generalizable to an independent, unseen, data set.
  • In data mining, the most common methods are twofold cross-validation and k-fold cross-validation.
    • In twofold cross-validation, the data are partitioned, using random assignment, into a training data set and a test data set. The test data set should then have the target variable omitted. Thus, the only systematic difference between the training data set and the test data set is that the training data includes the target variable and the test data does not.
    • A provisional data mining model is then constructed using the training samples provided in the training data set.
    • However, the algorithm needs to guard against "memorizing" the training set and blindly applying all patterns found in the training set to the future data. Ex: Just because all people named ‘David’ in the training set are in the high income bracket, it may not be True for all people in general.
    • Therefore, the next step is to examine how the provisional model performs on a test set of data. In the test set the provisional model performs classification according to the patterns and structures it learned from the training set.
    • The efficacy of the classifications is then evaluated by comparing them against the true values of the target variable.
    • The provisional model is then adjusted to minimize the error rate on the test set.
  • We must insure that the training and test data sets are independent, by validating the partition.
    • By performing graphical and statistical comparisons between the two sets.
    • For example, we may find that, even though the assignment of records was made randomly, a significantly higher proportion of positive values of an important flag variable were assigned to the training set, compared to the test set. This would bias our results.
    • It is especially important that the characteristics of the target variable be as similar as possible between the training and test data sets.
    • Hypothesis tests for validating the target variable, based on the type of target variable: t-test (for difference in means), z-test (for difference in proportions), test for homogeneity of proportions
  • Cross-validation guards against spurious results, as it is highly unlikely that the same random variation would be found to be significant in both the training set and the test set.
  • In k-fold cross validation, the original data is partitioned into k independent and similar subsets.
    • The model is then built using the data from k−1 subsets, using the \({k}^{\text{th}}\) subset as the test set.
    • This is done iteratively until we have k different models. The results from the k models are then combined using averaging or voting.
    • A popular choice for k is 10.
    • A benefit of using k-fold cross-validation is that each record appears in the test set exactly once; a drawback is that the requisite validation task is made more difficult.

45.5 Overfitting

44.3 Overfitting is the production of an analysis that corresponds too closely to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.

  • Increasing the complexity of the model in order to increase the accuracy on the training set eventually and inevitably leads to a degradation in the generalizability of the provisional model to the test set.
    • As the model complexity increases, the error rate on the training set continues to fall in a monotone manner.
    • However, as the model complexity increases, the test set error rate soon begins to flatten out and increase because the provisional model has memorized the training set rather than leaving room for generalizing to unseen data.

45.6 Bias-Variance Trade-Off

  • The low complexity model suffers from some classification errors. The classification errors can be reduced by a more complex model.
    • We might be tempted to adopt the greater complexity in order to reduce the error rate.
    • However, we should be careful not to depend on the idiosyncrasies of the training set.
    • The low-complexity model need not change very much to accommodate new data points. i.e. low-complexity model has low variance.
    • However, the high-complexity model must alter considerably if it is to maintain its low error rate. i.e. high-complexity model has a high variance.
Definition 45.5 Even though the high-complexity model has low bias (error rate), it has a high variance; and even though the low-complexity model has a high bias, it has a low variance. This is known as the bias-variance trade-off. It is another way of describing the overfitting-underfitting dilemma.
  • The goal is to construct a model in which neither the bias nor the variance is too high
    • A common method of evaluating how accurate model estimation is proceeding for a continuous target variable is to use the mean-squared error (MSE). (Target: Low MSE)
      • MSE is a good evaluative measure because it combines both bias and variance. i.e. \(\text{MSE} = \text{variance} + \text{bias}^2\)

36.20 Mean-Squared error (MSE) is an evaluating measure of accuracy of model estimation for a continuous target variable. It provides the estimate of \({\sigma}^2\). It is given by SSE divided by its degrees of freedom \((n - 2)\). i.e. \(s^2 = \text{MSE} = \frac{\text{SSE}}{n - 2}\). Where ‘s’ is the standard error of the estimate. Lower MSE is preferred.

45.7 Balancing the Training Dataset

  • For classification models, in which one of the target variable classes has much lower relative frequency than the other classes, balancing is recommended.
    • I guess Adult dataset can be suitable candidate for this because there is 75:25 ratio between the two levels of Target variable (income)
    • A benefit of balancing the data is to provide the classification algorithms with a rich balance of records for each classification outcome, so that the algorithms have a chance to learn about all types of records, not just those with high target frequency.
    • For example, suppose we are running a fraud classification model and our training data set consists of 100000 transactions, of which only 1000 are fraudulent. Then, our classification model could simply predict “non-fraudulent” for all transactions, and achieve 99% classification accuracy. However, clearly this model is useless. Instead, the analyst should balance the training data set so that the relative frequency of fraudulent transactions is increased.
  • There are two ways to accomplish this, which are as follows:
    • Resample a number of fraudulent (rare) records - Discouraged
    • Set aside a number of non-fraudulent (non-rare) records
Definition 45.6 Resampling refers to the process of sampling at random and with replacement from a data set. It is discouraged.
  • Suppose we wished our 1000 fraudulent records to represent 25% of the balanced training set, rather than the 1% represented by these records in the raw training data set.
    • \(x = \frac{p(\text{records}) - \text{rare}}{1 - p}\)
    • where \({x}\) is the required number of resampled records, \({p}\) represents the desired proportion of rare values in the balanced data set, ‘records’ represents the number of records in the unbalanced data set, and ‘rare’ represents the current number of rare target values
    • Thus \(x = \frac{0.25 * 100000 - 1000}{1 - 0.25} = 32000\) more records can be added to achieve 25% proportion of fraudulent records in balanced set.
    • Caution: Some people discourage this practice because they feel this amounts to fabricating data.
  • Alternatively, a sufficient number of non-fraudulent transactions would instead be set aside, thereby increasing the proportion of fraudulent transactions.
    • To achieve a 25% balance proportion, we would retain only 3000 non-fraudulent records. i.e. discard 96000 of the 99000 non-fraudulent records from the analysis, using random selection.
    • Caution: Data mining models might suffer as a result of starving them of data in this way.
    • Thus, it is advised to decrease the desired balance proportion to something like 10%.
  • The test data set should never be balanced.
    • The test data set represents new data that the models have not seen yet.
    • Note that all model evaluation will take place using the test data set, so that the evaluative measures will all be applied to unbalanced (real-world-like) data.
  • Direct overall comparisons between the original and balanced data sets are futile, as changes in character are inevitable.
    • Because some predictor variables have higher correlation with the target variable than do other predictor variables, the character of the balanced data will change.
    • For example, suppose we are working with the Churn data set, and suppose that churners have higher levels of ‘day minutes’ than non-churners. Then, when we balance the data set, the overall mean of ‘day minutes’ will increase, as we have eliminated so many non-churner records. Such changes cannot be avoided when balancing data sets.
    • However, apart from these unavoidable changes, and although the random sampling tends to protect against systematic deviations, data analysts should provide evidence that their balanced data sets do not otherwise differ systematically from the original data set.
    • This can be accomplished by examining the graphics and summary statistics from the original and balanced data set, partitioned on the categories of the target variable.
    • Hypothesis tests may be applied.
    • If deviations are uncovered, the balancing should be reapplied.
    • Cross-validation measures can be applied if the analyst is concerned about these deviations.
      • Multiple randomly selected balanced data sets can be formed, and the results averaged, for example.

45.8 Baseline Performance

For example, suppose we report that “only” 28.4% of customers adopting our International Plan will churn. That does not sound too bad, until we recall that, among all of our customers, the overall churn rate is only 14.49%. This overall churn rate may be considered our baseline, against which any further results can be calibrated. Thus, belonging to the International Plan actually nearly doubles the churn rate, which is clearly not good.

For example, suppose the algorithm your analytics company currently uses succeeds in identifying 90% of all fraudulent online transactions. Then, your company will probably expect your new data mining model to outperform this 90% baseline.

Validation


(C40)


(C41)


(C42)


(C43)


(C44)


(C45)


(C46)


(C47)


(C48)


46 Hierarchical and K-means Clustering

46.1 Overview

Definition 46.1 Clustering refers to the grouping of records, observations, or cases into classes of similar objects. Clustering differs from classification in that there is no target variable for clustering.
Definition 46.2 A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters.
  • Cluster vs. Classification
    • Clustering differs from classification in that there is no target variable for clustering.
    • The clustering task does not try to classify, estimate, or predict the value of a target variable.
    • Instead, clustering algorithms seek to segment the entire data set into relatively homogeneous subgroups or clusters, where the similarity of the records within the cluster is maximized, and the similarity to records outside this cluster is minimized.
      • Example: All zipcodes of a country can be described in terms of distinct lifestyle types (e.g. 66 types). One of them “Upper Crust” might be the wealthiest lifestyle of the country i.e. couples between ages of 45-64 without any dependents (children, parents) living at their home. This segment might be having median earnings of ~1-million dollars and might be possessing a postgraduate degree. No other lifestyle type would have a more opulent standard of living.
      • Methods: Hierarchical and k-means clustering, Kohonen networks, BIRCH clustering
    • For optimal performance, clustering algorithms, just like algorithms for classification, require the data to be normalized so that no particular variable or subset of variables dominates the analysis.
    • Clustering algorithms seek to construct clusters of records such that the between-cluster variation is large compared to the within-cluster variation.
    • For continuous variables, we can use euclidean distance
    • For categorical variables, we may again define the “different from” function for comparing the \(i^{\text{th}}\) attribute values of a pair as 0 when \(x_i = y_i\) and 1 otherwise.
Definition 46.3 Euclidean distance between records is given by equation, \(d_{\text{Euclidean}}(x,y) = \sqrt{\sum_i{\left(x_i - y_i\right)^2}}\), where \(x = \{x_1, x_2, \ldots, x_m\}\) and \(y = \{y_1, y_2, \ldots, y_m\}\) represent the \({m}\) attribute values of two records.

46.2 Hierarchical Clustering

Definition 46.4 In hierarchical clustering, a treelike cluster structure (dendrogram) is created through recursive partitioning (divisive methods) or combining (agglomerative) of existing clusters.
Definition 46.5 Agglomerative clustering methods initialize each observation to be a tiny cluster of its own. Then, in succeeding steps, the two closest clusters are aggregated into a new combined cluster. In this way, the number of clusters in the data set is reduced by one at each step. Eventually, all records are combined into a single huge cluster. mMost computer programs that apply hierarchical clustering use agglomerative methods.
Definition 46.6 Divisive clustering methods begin with all the records in one big cluster, with the most dissimilar records being split off recursively, into a separate cluster, until each record represents its own cluster.

46.3 Distance between Clusters

Definition 46.7 Single linkage, the nearest-neighbor approach, is based on the minimum distance between any record in cluster A and any record in cluster B. Cluster similarity is based on the similarity of the most similar members from each cluster. It tends to form long, slender clusters, which may sometimes lead to heterogeneous records being clustered together.
Definition 46.8 Complete linkage, the farthest-neighbor approach, is based on the maximum distance between any record in cluster A and any record in cluster B. Cluster similarity is based on the similarity of the most dissimilar members from each cluster. It tends to form more compact, spherelike clusters.
Definition 46.9 Average linkage is designed to reduce the dependence of the cluster-linkage criterion on extreme values, such as the most similar or dissimilar records. The criterion is the average distance of all the records in cluster A from all the records in cluster B. The resulting clusters tend to have approximately equal within-cluster variability. In general, average linkage leads to clusters more similar in shape to complete linkage than does single linkage.

46.4 k-means Clustering

  • Refer k-means Algorithm

  • Refer Pseudo F-Statistic

  • One potential problem for applying the k-means algorithm is: Who decides how many clusters to search for i.e. who decides k

    • Unless the analyst has a priori knowledge of the number of underlying clusters; therefore, an “outer loop” should be added to the algorithm, which cycles through various promising values of k.
    • Clustering solutions for each value of k can therefore be compared, with the value of k resulting in the largest F statistic being selected.
    • Alternatively, some clustering algorithms, such as the BIRCH clustering algorithm, can select the optimal number of clusters.
  • What if some attributes are more relevant than others to the problem formulation

    • As cluster membership is determined by distance, we may apply the axis-stretching methods for quantifying attribute relevance.

Validation


(C50)


(C51)


47 Cluster Goodness

47.1 Overview

  • “Measuring Cluster Goodness”
Definition 47.1 Cluster separation represents how distant the clusters are from each other.
Definition 47.2 Cluster cohesion refers to how tightly related the records within the individual clusters are. SSE accounts only for cluster cohesion.

47.2 Silhouette Method

Definition 47.3 The silhouette is a characteristic of each data value. For each data value i, \(\text{Silhouette}_i = s_i = \frac{b_i - a_i}{\text{max}(b_i, a_i)} \to s_i \in [-1, 1]\), where \(a_i\) is the distance between the data value (Cohesion) and its cluster center, and \(b_i\) is the distance between the data value and the next closest cluster center (Separation).
  • The silhouette value is used to gauge how good the cluster assignment is for that particular point.
    • A positive value indicates that the assignment is good, with higher values being better than lower values.
    • A value that is close to zero is considered to be a weak assignment, as the observation could have been assigned to the next closest cluster with limited negative consequence.
    • A negative silhouette value is considered to be misclassified, as assignment to the next closest cluster would have been better.
    • It accounts for both separation and cohesion.
      • \(a_i\) represents cohesion, as it measures the distance between the data value and its cluster center
      • \(b_i\) represents separation, as it measures the distance between the data value and a different cluster.
    • Taking the average silhouette value over all records yields a useful measure of how well the cluster solution fits the data.

47.3 Silhouette on Iris Data

“ForLater” - Nothing groundbreaking there for now.

47.4 Pseudo F-Statistic

Definition 47.4 The pseudo-F statistic is measures the ratio of (i) the separation between the clusters, as measured by the mean square between the clusters (MSB), to (ii) the spread of the data within the clusters as measured by the mean square error (MSE). i.e. \(F_{k-1, N-k} = \frac{\text{MSB}}{\text{MSE}} = \frac{\text{SSB}/{k-1}}{\text{SSE}/{N-k}}\)
  • Pseudo F-Statistic
    • Clustering algorithms seek to construct clusters of records such that the between-cluster variation is large compared to the within-cluster variation. Because this concept is analogous to the analysis of variance, we may define a pseudo-F statistic
    • MSB represents the between-cluster variation and MSE represents the within-cluster variation.
    • Thus, a “good” cluster would have a large value of the pseudo-F statistic, representing a situation where the between-cluster variation is large compared to the within-cluster variation.
    • Hence, as the k-means algorithm proceeds, and the quality of the clusters increases, we would expect MSB to increase, MSE to decrease, and F to increase.
    • Caution:
      • pseudo-F statistic should not be used to test for the presence of clusters in data.
      • However, if we have reason to believe that clusters do exist in the data, but we do not know how many clusters there are, then pseudo-F can be helpful.

47.5 Silhouette on Iris Data

“ForLater” - NOTE that Pseudo F-Statistic prefer k=3 in contrast to the Silhouette which preferred k=2.

Validation


48 Association Rules

48.1 Overview

Definition 48.1 Affinity analysis, (or Association Rules or Market Basket Analysis), is the study of attributes or characteristics that “go together.” It seeks to uncover rules for quantifying the relationship between two or more attributes. Association rules take the form "If antecedent, then consequent", along with a measure of the support and confidence associated with the rule.
  • Example Beer-Diaper:
    • If out of 1000 customers, 200 bought diapers and of the 200 who bought diapers, 50 bought beer.
    • Then Association Rule: “If buy diapers, then buy beer.”
    • Prior Proportion = Support = Consequent / Total = 50/1000 = 5%
    • Confidence = Consequent / Antecedent = 50 / 200 = 25%
  • Problem: Dimensionality
    • The number of possible association rules grows exponentially in the number of attributes.
    • Ex: Suppose that a store has 100 items and any combination of them can be purchased by a customer. i.e. a customer can buy or not buy each of the items. Then there are \(2^{100}\) association rules
    • The a priori algorithm for mining association rules, however, takes advantage of structure within the rules themselves to reduce the search problem to a more manageable size.

48.2 Data Representation for Market Basket Analysis

  • Assuming we ignore quantity purchased. Currently, we are trying to identify which items go together.
  • Transaction data: Each row represents a transaction. Two variables ID & Items (“Apple, Banana, Orange”)
    • It can be converted to long format where Two variables ID & Items and Items contain only 1 unique item and ID is not unique any more.
  • Tabular Data: (Wider) ID Column and each item has its own unique column. ID is unique. Columns are Binary with 1 as Yes /Purchased and 0 representing No /did not buy.
    • Note: For simplicity, the variable here is Flag (Categorical, Binary). However, the a priori algorithm can take Categorical data with more than 2 levels without any issue.

48.3 Set Theory

  • Refer Association Rules
  • Let \(I\) represent set of items.
  • Let \(D\) be the set of transactions represented where each Transaction ‘T’ in D represents a set of items contained in I.
  • Suppose that we have a particular set of items A (e.g., potato and tomato), and another set of items B (e.g., onion).
  • Then an association rule takes the form if A, then B (i.e., \(A \Rightarrow B\)), where the antecedent A and the consequent B are proper subsets of I, and A and B are mutually exclusive.
    • This definition would exclude, for example, trivial rules such as if potato and tomato, then potato.
Definition 48.2 The support (s) for a particular association rule \(A \Rightarrow B\) is the proportion of transactions in the set of transactions D that contain both antecedent A and consequent B. Support is Symmetric. \(\text{Support} = P(A \cap B) = \frac{\text{Number of transactions containing both A and B}}{\text{Total Number of Transactions}}\)
Definition 48.3 The confidence (c) of the association rule \(A \Rightarrow B\) is a measure of the accuracy of the rule, as determined by the percentage of transactions in the set of transactions D containing antecedent A that also contain consequent B. Confidence is Asymmetric \(\text{Confidence} = P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{\text{Number of transactions containing both A and B}}{\text{Total Number of Transactions containing A}}\)
  • Analysts may prefer rules that have either high support or high confidence, and usually both.
    • Strong rules are those that meet or surpass certain minimum support and confidence criteria.
    • For example, an analyst interested in finding which supermarket items are purchased together may set a minimum support level of 20% and a minimum confidence level of 70%.
    • However, a fraud detection analyst or a terrorism detection analyst would need to reduce the minimum support level to 1% or less, because comparatively few transactions are either fraudulent or terror-related.
    • To provide an overall measure of usefulness for an association rule, analysts sometimes multiply the support with confidence. This allows the analyst to rank the rules according to a combination of prevalence and accuracy.
Definition 48.4 An itemset is a set of items contained in I, and a k-itemset is an itemset containing k items. For example, {Potato, Tomato} is a 2-itemset, and {Potato, Tomato, Onion} is a 3-itemset, each from the vegetable stand set I.
Definition 48.5 The itemset frequency is simply the number of transactions that contain the particular itemset.
Definition 48.6 A frequent itemset is an itemset that occurs at least a certain minimum number of times, having itemset frequency \(\geq \phi\). We denote the set of frequent k-itemsets as \(F_k\).

48.4 Mining Association Rules

  • It is a Two-step Process
    • Find all frequent itemsets; that is, find all itemsets with frequency \(\geq \phi\).
    • From the frequent itemsets, generate association rules satisfying the minimum support and confidence conditions.
Definition 48.7 a priori property: If an itemset Z is not frequent, then for any item A, \(Z \cup A\) will not be frequent. In fact, no superset of Z (itemset containing Z) will be frequent.
  • The a priori algorithm takes advantage of the a priori property to shrink the search space.

    • This property reduces significantly the search space for the a priori algorithm.
    • DataType:
      • It can handle Categorical data without any issue.
      • Numerical attributes need to be supplied after discretisation. However, this will result in loss of information.
        • Alternative method for mining association rules is generalised rule induction (GRI) which can handle either categorical or numerical variables as inputs, but still requires categorical variables as outputs.
  • “ForLater” - Apply a priori on Adult Dataset

  • “ForLater” - “Association Rules are easy to do badly” - Example: Adult Dataset

    • If ‘workclass is Private’ then ‘Sex is Male’ with Support 69.5% and Confidence 65.6%
      • One needs to take into account the raw (prior) proportion of males in the data set, which in this case is 66.8%. In other words, applying this association rule actually reduces the probability of randomly selecting a male from 0.6684 to 0.6563.
    • Why, then, if the rule is so useless, did the software report it
      • The quick answer is that the default ranking mechanism is confidence.
    • (Aside), So maybe we can drop those rules which have lower confidence than the proportion in the dataset
    • We can provide a priori association rules using the confidence difference as the evaluative measure.
      • Here, rules are favored that provide the greatest increase in confidence from the prior to the posterior.
      • Ex: If ‘Marital status= Divorced’ then ‘Sex=Female,’ Support 13.%, Confidence 60%
        • The data set contains 33.16% females, so an association rule that can identify females with 60% confidence is useful.
        • The confidence difference for this association rule is 0.60029−0.3316=0.26869 between the prior and posterior confidences.
    • Alternatively, analysts may prefer to use the confidence ratio to evaluate potential rules
      • Confidence Ratio of above rule is 0.4476

48.5 Usefulness of Association Rules

Definition 48.8 Lift is a measure that can quantify the usefulness of an association rule. Lift is Symmetric. \(\text{Lift} = \frac{\text{Rule Confidence}}{\text{Prior proportion of Consequent}}\)

Not all association rules are equally useful. Thus Lift can be used to quantify its usefulness.

  • Example Beer-Diaper:
    • If out of 1000 customers, 200 bought diapers and of the 200 who bought diapers, 50 bought beer.
    • Then Association Rule: “If buy diapers, then buy beer.”
    • Prior Proportion = Support = Consequent / Total = 50/1000 = 5%
    • Confidence = Consequent / Antecedent = 50 / 200 = 25%
    • Lift = Confidence / Support = 25/5 = 5
    • “Customers who buy diapers are five times as likely to buy beer as customers from the entire data set.”
  • Question: We know 50 beer out of 200 diaper. But we do not know how many beer overall out of total 1000.
    • See the next example of diaper-makeup. Here we have separate proportions available. 40 makeup in 1000, 5 makeup in 200.
    • The above diaper-beer woule be valid if the missing information is added in the form that 50 beer in 1000 and 50 beer in 200 diaper. In that case it is obvious that people who buy diapers are definitely 5 times as likely to buy beer.
    • In fact we can say that people who are not buying diapers are ‘somehow’ not buying beer at all. Extrapolate from Diapers to Babies and from Beer to Alchoholism and we can say with full judgemental eyes that “Babies are the cause of Alchocholism.” Hence Proved!
  • Diaper-Makeup Situation:
    • “40 of the 1000 customers bought expensive makeup, whereas, of the 200 customers who bought diapers, only 5 bought expensive makeup.”
    • Then Association Rule: “If buy diapers, then buy expensive makeup”
    • Prior Proportion = Support = Consequent / Total = 40/1000 = 4%
    • Confidence = Consequent / Antecedent = 5 / 200 = 2.5%
    • Lift = Confidence / Support = 2.5/4 = 0.625
    • So, customers who buy diapers are only 62.5% as likely to buy expensive makeup as customers in the entire data set.
  • Lift
    • Lift value of 1 implies that A and B are independent events, meaning that knowledge of the occurrence of A does not alter the probability of the occurrence of B. Such relationships are not useful from a data mining perspective, and thus we prefer our association rules to have a lift value different from 1.
  • Association Rules are Supervised or Unsupervised
    • most data mining methods represent supervised learning
    • Association rule mining, however, can be applied in either a supervised or an unsupervised manner.
    • Analysis of purchase patterns would be unsupervised because we are interested in which items go together. However, analysis of voter profile can be supervised because voting preference, naturally, acts as a Target and fulfill the role of consequent not antecedent.

Validation

# #SUMMARISED Packages and Objects (BOOK CHECK)
f_()
## [1] "ii_num, jj_chr"
#
difftime(Sys.time(), k_start)
## Time difference of 1.316533 mins

(C54)


(C55)


(C56)


(C57)


(C58)


(C59)


(C60)


(C61)


(C62)


References

Daniel T. Larose, Chantal D. Larose. 2015. Data Mining and Predictive Analytics. Second Edition. Danvers, MA 01923 USA: Wiley. https://www.wiley.com.
David R. Anderson, Thomas A. Williams, Dennis J. Sweeney. 2018. Statistics for Business and Economics. Revised 13e. Boston, MA 02210 USA: Cengage Learning. https://www.cengage.com.

Glossary

THEOREMS

DEFINITIONS

1.1: Vectors

Vectors are the simplest type of data structure in R. A vector is a sequence of data elements of the same basic type.

1.2: Components

Members of a vector are called components.

1.3: Packages

Packages are the fundamental units of reproducible R code.

1.4: Rounding

Rounding means replacing a number with an approximate value that has a shorter, simpler, or more explicit representation.

1.5: Significant-Digits

Significant digits, (or significant figures, or precision or resolution), of a number in positional notation are digits in the number that are reliable and necessary to indicate the quantity of something.

2.1: R-Markdown

R Markdown is a file format for making dynamic documents with R.

2.2: NA

NA is a logical constant of length 1 which contains a missing value indicator.

2.3: Factors

Factors are the data objects which are used to categorize the data and store it as levels.

2.4: Lists

Lists are by far the most flexible data structure in R. They can be seen as a collection of elements without any restriction on the class, length or structure of each element.

2.5: DataFrame

Data Frames are lists with restriction that all elements of a data frame are of equal length.

7.1: H-Variances

\(\text{\{Variances\}} {H_0} : {\sigma}_1 = {\sigma}_2 = \dots = {\sigma}_k \iff {H_a}: \text{At least two variances differ.}\)

8.1: Imputation

Imputation is the process of replacing missing data with substituted values. Imputation preserves all cases by replacing missing data with an estimated value based on other available information.

16.1: Redundant

A rule can be defined as redundant if a more general rules with the same or a higher confidence exists.

23.1: Data

Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.

23.2: Elements

Elements are the entities on which data are collected. (Generally ROWS)

23.3: Variable

A variable is a characteristic of interest for the elements. (Generally COLUMNS)

23.4: Observation

The set of measurements obtained for a particular element is called an observation.

23.5: Statistics

Statistics is the art and science of collecting, analysing, presenting, and interpreting data.

23.6: Scale-of-Measurement

The scale of measurement determines the amount of information contained in the data and indicates the most appropriate data summarization and statistical analyses.

23.7: Nominal-Scale

When the data for a variable consist of labels or names used to identify an attribute of the element, the scale of measurement is considered a nominal scale.

23.8: Ordinal-Scale

The scale of measurement for a variable is considered an ordinal scale if the data exhibit the properties of nominal data and in addition, the order or rank of the data is meaningful.

23.9: Interval-Scale

The scale of measurement for a variable is an interval scale if the data have all the properties of ordinal data and the interval between values is expressed in terms of a fixed unit of measure.

23.10: Ratio-Scale

The scale of measurement for a variable is a ratio scale if the data have all the properties of interval data and the ratio of two values is meaningful.

23.11: Categorical-Data

Data that can be grouped by specific categories are referred to as categorical data. Categorical data use either the nominal or ordinal scale of measurement.

23.12: Quantitative-Data

Data that use numeric values to indicate ‘how much’ or ‘how many’ are referred to as quantitative data. Quantitative data are obtained using either the interval or ratio scale of measurement.

23.13: Discrete

Quantitative data that measure ‘how many’ are discrete.

23.14: Continuous

Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.

23.15: Cross-Sectional-Data

Cross-sectional data are data collected at the same or approximately the same point in time.

23.16: Time-Series-Data

Time-series data data are data collected over several time periods.

23.17: Observational-Study

In an observational study we simply observe what is happening in a particular situation, record data on one or more variables of interest, and conduct a statistical analysis of the resulting data.

23.18: Experiment

The key difference between an observational study and an experiment is that an experiment is conducted under controlled conditions.

23.19: Descriptive-Statistics

Most of the statistical information is summarized and presented in a form that is easy to understand. Such summaries of data, which may be tabular, graphical, or numerical, are referred to as descriptive statistics.

23.20: Population

A population is the set of all elements of interest in a particular study.

23.21: Sample

A sample is a subset of the population.

23.22: Parameter-vs-Statistic

The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.

23.23: Census

The process of conducting a survey to collect data for the entire population is called a census.

23.24: Sample-Survey

The process of conducting a survey to collect data for a sample is called a sample survey.

23.25: Statistical-Inference

Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.

23.26: Analytics

Analytics is the scientific process of transforming data into insight for making better decisions.

23.27: Descriptive-Analytics

Descriptive analytics encompasses the set of analytical techniques that describe what has happened in the past.

23.28: Predictive-Analytics-301

Predictive analytics consists of analytical techniques that use models constructed from past data to predict the future or to assess the impact of one variable on another.

23.29: Prescriptive-Analytics

Prescriptive analytics is the set of analytical techniques that yield a best course of action.

23.30: Big-Data

Larger and more complex data sets are now often referred to as big data.

23.31: Data-Mining-301

Data Mining deals with methods for developing useful decision-making information from large databases. It can be defined as the automated extraction of predictive information from (large) databases.

24.1: Frequency-Distribution

A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several non-overlapping categories or classes.

24.2: Cross-Tab

A crosstabulation is a tabular summary of data for two variables. It is used to investigate the relationship between them. Generally, one of the variable is categorical.

25.1: Number

A number is a mathematical object used to count, measure, and label. Their study or usage is called arithmetic, a term which may also refer to number theory, the study of the properties of numbers.

25.2: Prime

A prime number is a natural number greater than 1 that is not a product of two smaller natural numbers. A natural number greater than 1 that is not prime is called a ‘composite number.’ 1 is neither a Prime nor a composite, it is a ‘Unit.’ Thus, by definition, Negative Integers and Zero cannot be Prime.

25.3: Parity-Odd-Even

Parity is the property of an integer \(\mathbb{Z}\) of whether it is even or odd. It is even if the integer is divisible by 2 with no remainders left and it is odd otherwise. Thus, -2, 0, +2 are even but -1, 1 are odd. Numbers ending with 0, 2, 4, 6, 8 are even. Numbers ending with 1, 3, 5, 7, 9 are odd.

25.4: Positive-Negative

An integer \(\mathbb{Z}\) is positive if it is greater than zero, and negative if it is less than zero. Zero is defined as neither negative nor positive.

25.5: Mersenne-Primes

Mersenne primes are those prime number that are of the form \((2^n -1)\); that is, \(\{3, 7, 31, 127, \ldots \}\)

25.6: Measures-of-Location

Measures of location are numerical summaries that indicate where on a number line a certain characteristic of the variable lies. Examples of the measures of location are percentiles and quantiles.

25.7: Measures-of-Center

The measures of center are a special case of measures of location. These estimate where the center of a particular variable lies. Most common are Mean, Median, and Mode.

25.8: Mean

Given a data set \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({{x}_1, {x}_2, \ldots, {x}_n}\) divided by the count \({n}\).

25.9: Median

Median of a population is any value such that at most half of the population is less than the proposed median and at most half is greater than the proposed median.

25.10: Geometric-Mean

The geometric mean \(\overline{x}_g\) is a measure of location that is calculated by finding the n^{th} root of the product of \({n}\) values.

25.11: Mode

The mode is the value that occurs with greatest frequency.

25.12: Percentile

A percentile provides information about how the data are spread over the interval from the smallest value to the largest value. For a data set containing \({n}\) observations, the \(p^{th}\) percentile divides the data into two parts: approximately p% of the observations are less than the \(p^{th}\) percentile, and approximately (100 - p)% of the observations are greater than the \(p^{th}\) percentile.

25.13: Measures-of-Spread

Measures of spread (or the measures of variability) describe how spread out the data values are. Examples are Range, SD, mean absolute deviation, and IQR

25.14: Variance

The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.

25.15: Standard-Deviation

The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.

25.16: Skewness

Skewness \((\tilde{\mu}_{3})\) is a measure of the shape of a data distribution. It is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.

25.17: Tails

A tail refers to the tapering sides at either end of a distribution curve.

25.18: Kurtosis

Kurtosis \((\tilde{\mu}_{4})\) is a measure of the “tailedness” of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution. For \({\mathcal{N}}_{(\mu, \, \sigma)}\), kurtosis is 3 and excess kurtosis is 0 (i.e. subtract 3).

25.19: TheSample

A sample of \({n}\) observations given by \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\) have a sample mean \({\overline{x}}\) and the sample standard deviation, \({s}\).

25.20: z-Scores

The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.

25.21: t-statistic

Computing a z-score requires knowing the mean \({\mu}\) and standard deviation \({\sigma}\) of the complete population to which a data point belongs. If one only has a sample of observations from the population, then the analogous computation with sample mean \({\overline{x}}\) and sample standard deviation \({s}\) yields the t-statistic.

25.22: Chebyshev-Theorem

Chebyshev Theorem can be used to make statements about the proportion of data values that must be within a specified number of standard deviations \({\sigma}\), of the mean \({\mu}\).

25.23: Empirical-Rule

Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.

25.24: Outliers

Outliers are data points or observations that does not fit the trend shown by the remaining data. These differ significantly from other observations. Unusually large or small values are commonly found to be outliers.

25.25: Covariance

Covariance is a measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship.

25.26: Correlation-Coefficient

Correlation coefficient is a measure of linear association between two variables that takes on values between -1 and +1. Values near +1 indicate a strong positive linear relationship; values near -1 indicate a strong negative linear relationship; and values near zero indicate the lack of a linear relationship.

26.1: Probability

Probability is a numerical measure of the likelihood that an event will occur. Probability values are always assigned on a scale from 0 to 1. A probability near zero indicates an event is unlikely to occur; a probability near 1 indicates an event is almost certain to occur.

26.2: Random-Experiment

A random experiment is a process that generates well-defined experimental outcomes. On any single repetition or trial, the outcome that occurs is determined completely by chance.

26.3: Sample-Space

The sample space for a random experiment is the set of all experimental outcomes.

26.4: Counting-Rule

Counting Rule for Multiple-Step Experiments: If an experiment can be described as a sequence of \({k}\) steps with \({n_1}\) possible outcomes on the first step, \({n_2}\) possible outcomes on the second step, and so on, then the total number of experimental outcomes is given by \(\{(n_1)(n_2) \cdots (n_k) \}\)

26.5: Tree-Diagram

A tree diagram is a graphical representation that helps in visualizing a multiple-step experiment.

26.6: Factorial

The factorial of a non-negative integer \({n}\), denoted by \(n!\), is the product of all positive integers less than or equal to n. The value of 0! is 1 i.e. \(0!=1\)

26.7: Combinations

Combination allows one to count the number of experimental outcomes when the experiment involves selecting \({k}\) objects from a set of \({N}\) objects. The number of combinations of \({N}\) objects taken \({k}\) at a time is equal to the binomial coefficient \(C_k^N\)

26.8: Permutations

Permutation allows one to compute the number of experimental outcomes when \({k}\) objects are to be selected from a set of \({N}\) objects where the order of selection is important. The same \({k}\) objects selected in a different order are considered a different experimental outcome. The number of permutations of \({N}\) objects taken \({k}\) at a time is given by \(P_k^N\)

26.9: Event

An event is a collection of sample points. The probability of any event is equal to the sum of the probabilities of the sample points in the event. The sample space, \({s}\), is an event. Because it contains all the experimental outcomes, it has a probability of 1; that is, \(P(S) = 1\)

26.10: Complement

Given an event \({A}\), the complement of A (\(A^c\)) is defined to be the event consisting of all sample points that are not in A. Thus, \(P(A) + P(A^{c}) =1\)

26.11: Union

Given two events A and B, the union of A and B is the event containing all sample points belonging to A or B or both. The union is denoted by \(A \cup B\)

26.12: Intersection

Given two events A and B, the intersection of A and B is the event containing the sample points belonging to both A and B. The intersection is denoted by \(A \cap B\)

26.13: Mutually-Exclusive

Two events are said to be mutually exclusive if the events have no sample points in common. Thus, \(A \cap B = 0\)

26.14: Conditional-Probability

Conditional probability is the probability of an event given that another event already

26.14: Conditional-Probability

occurred. The conditional probability of ‘A given B’ is \(P(A|B) = \frac{P(A \cup B)}{P(B)}\)

26.15: Events-Independent

Two events A and B are independent if \(P(A|B) = P(A) \quad \text{OR} \quad P(B|A) = P(B) \Rightarrow P(A \cap B) = P(A) \cdot P(B)\)

27.1: Random-Variable

A random variable is a numerical description of the outcome of an experiment. Random variables must assume numerical values. It can be either ‘discrete’ or ‘continuous.’

27.2: Discrete-Random-Variable

A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.

27.3: Continuous-Random-Variable

A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.

27.4: Probability-Distribution

The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.

27.5: Probability-Function

For a discrete random variable x, a probability function \(f(x)\), provides the probability for each value of the random variable.

27.6: Expected-Value-Discrete

The expected value, or mean, of a random variable is a measure of the central location for the random variable. i.e. \(E(x) = \mu = \sum xf(x)\)

27.7: Variance-Discrete

The variance is a weighted average of the squared deviations of a random variable from its mean. The weights are the probabilities. i.e. \(\text{Var}(x) = \sigma^2 = \sum \{(x- \mu)^2 \cdot f(x)\}\)

27.8: Bivariate

A probability distribution involving two random variables is called a bivariate probability distribution. A discrete bivariate probability distribution provides a probability for each pair of values that may occur for the two random variables.

28.1: Uniform-Probability-Distribution

Uniform probability distribution is a continuous probability distribution for which the probability that the random variable will assume a value in any interval is the same for each interval of equal length. Whenever the probability is proportional to the length of the interval, the random variable is uniformly distributed.

28.2: Probability-Density-Function

The probability that the continuous random variable \({x}\) takes a value between \([a, b]\) is given by the area under the graph of probability density function \(f(x)\); that is, \(A = \int _{a}^{b}f(x)\ dx\). Note that \(f(x)\) can be greater than 1, however its integral must be equal to 1.

28.3: Normal-Distribution

A normal distribution (\({\mathcal{N}}_{({\mu}, \, {\sigma}^2)}\)) is a type of continuous probability distribution for a real-valued random variable.

28.4: Standard-Normal

A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal{z}}_{({\mu} = 0, \, {\sigma} = 1)}\)

29.1: Sampled-Population

The sampled population is the population from which the sample is drawn.

29.2: Frame

Frame is a list of the elements that the sample will be selected from.

29.3: Target-Population

The target population is the population we want to make inferences about. Generally (adn preferably), it will be same as ‘Sampled-Population,’ but it may differ also.

29.4: SRS

A simple random sample (SRS) is a set of \({k}\) objects in a population of \({N}\) objects where all possible samples are equally likely to happen. The number of such different simple random samples is \(C_k^N\)

29.5: Sampling-without-Replacement

Sampling without replacement: Once an element has been included in the sample, it is removed from the population and cannot be selected a second time.

29.6: Sampling-with-Replacement

Sampling with replacement: Once an element has been included in the sample, it is returned to the population. A previously selected element can be selected again and therefore may appear in the sample more than once.

29.7: Random-Sample

A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.

29.8: Proportion

A population proportion \({P}\), is a parameter that describes a percentage value associated with a population. It is given by \(P = \frac{X}{N}\), where \({x}\) is the count of successes in the population, and \({N}\) is the size of the population. It is estimated through sample proportion \(\overline{p} = \frac{x}{n}\), where \({x}\) is the count of successes in the sample, and \({N}\) is the size of the sample obtained from the population.

29.9: Point-Estimation

To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.

29.10: Point-Estimator

A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)

29.11: Point-Estimate

The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.

29.12: Sampling-Distribution

The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).

29.13: Standard-Error

In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’

29.14: Sampling-Error

A sampling error is the difference between a population parameter and a sample statistic.

29.15: Central-Limit-Theorem

Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.

30.1: Interval-Estimate

Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)

30.2: Confidence-Interval

Confidence interval is another name for an interval estimate. Normally it is given as \(({\gamma} = 1 - {\alpha})\). Ex: 95% confidence interval

30.3: Confidence-Coefficient

The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.

30.4: t-distribution

When \({s}\) is used to estimate \({\sigma}\), the margin of error and the interval estimate for the population mean are based on a probability distribution known as the t distribution.

30.5: Degrees-of-Freedom

The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).

31.1: Hypothesis-Testing

Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.

31.2: Hypothesis-Null

Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.

31.3: Hypothesis-Alternative

Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.

31.4: Hypothesis-1T-Lower-Tail

\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)

31.5: Hypothesis-1T-Upper-Tail

\(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)

31.6: Hypothesis-2T-Two-Tail

\(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)

31.7: Error-Type-I

The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).

31.8: Error-Type-II

The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).

31.9: Level-of-Significance

The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.

31.10: Significance-Tests

Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.

31.11: Test-Statistic

Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.

31.12: Tailed-Test

A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.

31.13: One-Tailed-Test

One-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution.

31.14: 1s-known-sd

If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (31.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)

31.15: Approach-p-value

The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.

31.16: p-value

A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).

31.17: Approach-Critical-Value

The critical value approach requires that we first determine a value for the test statistic called the critical value.

31.18: Critical-Value

Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).

31.19: Acceptance-Region

A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.

31.20: Margin-Error

The margin of error tells how far the original population means might be from the sample mean. It is given by \(Z\frac{{\sigma}}{\sqrt{n}}\)

31.21: Rejection-Region

A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.

31.22: Two-Tailed-Test

Two-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution.

31.23: Approach-p-value-Steps

p-value Approach: Form Hypothesis | Specify \({\alpha}\) | Calculate test statistic | Calculate p-value | Compare p-value with \({\alpha}\) | Interpret

31.24: 1s-unknown-sd

If \({\sigma}\) is unknown, the sampling distribution of the test statistic follows the t distribution with \((n - 1)\) degrees of freedom. Refer equation (31.3) \(\to t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}}\)

31.25: H-1s-p-Lower

\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {p} \geq {p}_0 \iff {H_a}: {p} < {p}_0\)

31.26: H-1s-p-Upper

\(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)

31.27: H-1s-p-Two

\(\text{\{Two Tail Test \} } \thinspace {H_0} : {p} = {p}_0 \iff {H_a}: {p} \neq {p}_0\)

31.28: Power

The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).

31.29: Power-Curve

Power Curve is a graph of the probability of rejecting \({H_0}\) for all possible values of the population parameter \({\mu}\) not satisfying the null hypothesis. It provides the probability of correctly rejecting the null hypothesis.

32.1: Independent-Simple-Random-Samples

Let \({\mathcal{N}}_{({\mu}_1, \, {\sigma}_1)}\) and \({\mathcal{N}}_{({\mu}_2, \, {\sigma}_2)}\) be the two populations. To make an inference about the difference between the means \(({\mu}_1 - {\mu}_2)\), we select a simple random sample of \({n}_1\) units from population 1 and a second simple random sample of \({n}_2\) units from population 2. The two samples, taken separately and independently, are referred to as independent simple random samples.

32.2: H-2s-Lower

\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)

32.3: H-2s-Upper

\(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)

32.4: H-2s-Two

\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)

32.5: Shapiro-Wilk-Test

The Shapiro-Wilk test is a test of normality. It tests the null hypothesis that a sample came from a normally distributed population. \(P_{\text{shapiro}} > ({\alpha} = 0.05) \to \text{Data is Normal}\). Avoid using sample with more than 5000 observations.

32.6: Independent-Sample-Design-Example

Independent sample design: A simple random sample of workers is selected and each worker in the sample uses method 1. A second independent simple random sample of workers is selected and each worker in this sample uses method 2.

32.7: Matched-Sample-Design-Example

Matched sample design: One simple random sample of workers is selected. Each worker first uses one method and then uses the other method. The order of the two methods is assigned randomly to the workers, with some workers performing method 1 first and others performing method 2 first. Each worker provides a pair of data values, one value for method 1 and another value for method 2.

32.8: Hypo-Paired-Two

\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_d = 0 \iff {H_a}: {\mu}_d \neq 0\)

32.9: H-2s-p-Lower

\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {p}_1 - {p}_2 \geq 0 \iff {H_a}: {p}_1 - {p}_2 < 0\)

32.10: H-2s-p-Upper

\(\text{\{Right or Upper\} } {H_0} : {p}_1 - {p}_2 \leq 0 \iff {H_a}: {p}_1 - {p}_2 > 0\)

32.11: H-2s-p-Two

\(\text{\{Two Tail Test \} } \thinspace {H_0} : {p}_1 - {p}_2 = 0 \iff {H_a}: {p}_1 - {p}_2 \neq 0\)

33.1: Distribution-Chi-Square

Whenever a simple random sample of size \({n}\) is selected from a normal population, the sampling distribution of \(\frac{(n-1)s^2}{{\sigma}^2}\) is a chi-square distribution with \({n - 1}\) degrees of freedom.

33.2: H-1s-Var-Lower

\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\sigma}^2 \geq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 < {{\sigma}_0^2}\)

33.3: H-1s-Var-Upper

\(\text{\{Right or Upper\} } {H_0} : {\sigma}^2 \leq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 > {{\sigma}_0^2}\)

33.4: H-1s-Var-Two

\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}^2 = {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 \neq {{\sigma}_0^2}\)

33.5: Distribution-F

Whenever independent simple random samples of sizes \({n}_1\) and \({n}_2\) are selected from two normal populations with equal variances \(({\sigma}_1^2 = {\sigma}_2^2)\), the sampling distribution of \(\frac{{s}_1^2}{{s}_2^2}\) is an F distribution with \(({n}_1 - 1)\) degrees of freedom for the numerator and \(({n}_2 - 1)\) degrees of freedom for the denominator.

33.6: H-2s-Var-Lower

\(\text{\{Left or Lower \} }\space\thinspace \text{Do not do this.}\)

33.7: H-2s-Var-Upper

\(\text{\{Right or Upper\} } {H_0} : {\sigma}_1^2 \leq {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 > {\sigma}_2^2\)

33.8: H-2s-Var-Two

\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}_1^2 = {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 \neq {\sigma}_2^2\)

34.1: H-3p

\(\text{\{Equality of Population Proportions \}} {H_0} : {p}_1 = {p}_2 = \dots = {p}_k \iff {H_a}: \text{Not all population proportions are equal}\)

35.1: Randomization

Randomization is the process of assigning the treatments to the experimental units at random.

35.2: H-ANOVA

\(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)

36.1: Regression-Analysis

Regression analysis can be used to develop an equation showing how two or more variables are related.

36.2: Variable-Dependent

The variable being predicted is called the dependent variable \(({y})\).

36.3: Variable-Independent

The variable or variables being used to predict the value of the dependent variable are called the independent variables \(({x})\).

36.4: Simple-Linear-Regression

The simplest type of regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line, is called simple linear regression.

36.5: Regression-Model

The equation that describes how \({y}\) is related to \({x}\) and an error term \(\epsilon\) is called the regression model. For example, simple linear regression model is given by equation \({y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon}\)

36.6: Error-Term

The random variable, error term \(({\epsilon})\), accounts for the variability in \({y}\) that cannot be explained by the linear relationship between \({x}\) and \({y}\).

36.7: Regression-Equation

The equation that describes how the mean or expected value of \({y}\), denoted \(E(y)\), is related to \({x}\) is called the regression equation. Simple Linear Regression Equation is: \(E(y) = {\beta}_0 + {\beta}_1 {x}\). The graph of the simple linear regression equation is a straight line; \({\beta}_0\) is the y-intercept of the regression line, \({\beta}_1\) is the slope.

36.8: Estimated-Regression-Equation

Sample statistics (denoted \(b_0\) and \(b_1\)) are computed as estimates of the population parameters \({\beta}_0\) and \({\beta}_1\). Thus Estimated Simple Linear Regression Equation is: \(\hat{y} = b_0 + b_1 {x}\). The value of \(\hat{y}\) provides both a point estimate of \(E(y)\) for a given value of ‘x’ and a prediction of an individual value of ‘y’ for a given value of ‘x.’

36.9: Least-Squares

The least squares method is a procedure for using sample data to find the estimated regression equation. It uses the sample data to provide the values of \(b_0\) and \(b_1\) that minimize the sum of the squares of the deviations between the observed values of the dependent variable \(y_i\) and the predicted values of the dependent variable \(\hat{y}_i\). i.e. min\(\Sigma(y_i - \hat{y}_i)^2 or min(SSE)\)

36.10: Residuals

The deviations of the y values about the estimated regression line are called residuals. The \(i^{\text{th}}\) residual represents the error in using (predicted) \(\hat{y}_i\) to estimate (observed) \(y_i\).

36.11: SSE

The sum of squares of residuals or errors is the quantity that is minimized by the least squares method. This quantity, also known as the sum of squares due to error, is denoted by SSE. i.e. \(\text{SSE} = \Sigma(y_i - \hat{y}_i)^2\)

36.12: SSR

To measure how much the \(\hat{y}\) values on the estimated regression line deviate from \(\overline{y}\), another sum of squares is computed. This sum of squares, called the sum of squares due to regression, is denoted SSR. i.e. \(\text{SSR} = \Sigma(\hat{y}_i - \overline{y})^2\)

36.13: SST

For the \(i^{\text{th}}\) observation in the sample, the difference \(y_i - \overline{y}\) provides a measure of the error involved in using \(\overline{y}\) for prediction. The corresponding sum of squares, called the total sum of squares, is denoted SST. i.e. \(\text{SST} = \Sigma(y_i - \overline{y})^2 \to \text{SST} = \text{SSE} + \text{SSR}\). SST is a measure of the total variability in the values of the response variable alone, without reference to the predictor.

36.14: Coefficient-of-Determination

The ratio \(r^2 =\frac{\text{SSR}}{\text{SST}} \in [0, 1]\), is used to evaluate the goodness of fit for the estimated regression equation. This ratio is called the coefficient of determination (\(r^2\)). It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation.

36.15: Simple-Linear-Regression-Assumption-1

Regression Assumption 1/4 (Zero-Mean): For Regression Model \({y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon}\) : The error term \(\epsilon\) is a random variable with a mean or expected value of zero; \(E(\epsilon) = 0\). (Implication) \(\beta_0\) and \(\beta_1\) are constants, therefore \(E(\beta_0) = \beta_0\) and \(E(\beta_1) = \beta_1\); thus, for a given value of x, the expected value of y is given by Regression equation \(E(y) = {\beta}_0 + {\beta}_1 {x}\)

36.16: Simple-Linear-Regression-Assumption-2

Regression Assumption 2/4 (Constant Variance): For Regression Model \({y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon}\) and Regression equation \(E(y) = {\beta}_0 + {\beta}_1 {x}\) : The variance of \(\epsilon\), denoted by \({\sigma}^2\), is the same for all values of x. (Implication) The variance of y about the regression line equals \({\sigma}^2\) and is the same for all values of x.

36.17: Simple-Linear-Regression-Assumption-3

Regression Assumption 3/4 (Independence): For Regression Model \({y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon}\) and Regression equation \(E(y) = {\beta}_0 + {\beta}_1 {x}\) : The values of \(\epsilon\) are independent. (Implication) The value of \(\epsilon\) for a particular value of x is not related to the value of \(\epsilon\) for any other value of x; thus, the value of y for a particular value of x is not related to the value of y for any other value of x.

36.18: Simple-Linear-Regression-Assumption-4

Regression Assumption 4/4 (Normality): For Regression Model \({y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon}\) and Regression equation \(E(y) = {\beta}_0 + {\beta}_1 {x}\) : The error term \(\epsilon\) is a normally distributed random variable for all values of x. (Implication) Because y is a linear function of \(\epsilon\), y is also a normally distributed random variable for all values of x.

36.19: Simple-Linear-Regression-Assumption-Summary

Four Regression Assumptions: (1) Zero-Mean: \(E(\epsilon) = 0\). (2) Constant Variance: The variance of \(\epsilon\) (\({\sigma}^2\)) is the same for all values of x. (3) Independence: The values of \(\epsilon\) are independent. (4) Normality: The error term \(\epsilon\) has a normal distribution.

36.20: MSE

Mean-Squared error (MSE) is an evaluating measure of accuracy of model estimation for a continuous target variable. It provides the estimate of \({\sigma}^2\). It is given by SSE divided by its degrees of freedom \((n - 2)\). i.e. \(s^2 = \text{MSE} = \frac{\text{SSE}}{n - 2}\). Where ‘s’ is the standard error of the estimate. Lower MSE is preferred.

36.21: Standard-Error-B1

Standard deviation of \(b_1\) is \({\sigma}_{b_1}\). Its estimate, estimated standard deviation of \(b_1\), is given by \(s_{b_1} = \frac{s}{\sqrt{\Sigma (x_i - {\overline{x}})^2}}\). The standard deviation of \(b_1\) is also referred to as the standard error of \(b_1\). Thus, \(s_{b_1}\) provides an estimate of the standard error of \(b_1\).

36.22: H-SimpleRegression

\(\text{\{Test for Significance in Simple Linear Regression\} } {H_0} : {\beta}_1 = 0 \iff {H_a}: {\beta}_1 \neq 0\)

36.23: MSR

The mean square due to regression (MSR) provides the estimate of \({\sigma}^2\). It is given by SSR divided by its degrees of freedom. If the standard error of the estimate is denoted by ‘s’ then \(s^2 = \text{MSR} = \frac{\text{SSR}}{\text{Regression degrees of freedom}} = \frac{\text{SSR}}{\text{Number of independent variables}}\)

36.24: QQ-Plot

A normal probability plot is a quantile-quantile plot of the quantiles of a particular distribution against the quantiles of the standard normal distribution, for the purposes of determining whether the specified distribution deviates from normality.

36.25: High-Leverage-Points

High leverage points are observations with extreme values for the independent variables. The leverage of an observation is determined by how far the values of the independent variables are from their mean values.

36.26: Influential-Observations

Influential observations are those observations which have a strong influence or effect on the regression results. Influential observations can be identified from a scatter diagram when only one independent variable is present.

36.27: Cook-Distance

Cook distance (\(D_i\)) is the most common measure of the influence of an observation. It works by taking into account both the size of the residual and the amount of leverage for that observation. Generally an observation is influential is if \(D_i > 1\)

37.1: Multiple-Regression

Multiple regression analysis is the study of how a dependent variable \(y\) is related to two or more independent variables. Multiple Regression Model is \({y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon\)

37.2: Multiple-Regression-Equation

The equation that describes how the mean or expected value of \({y}\), denoted \(E(y)\), is related to \({x}\) is called the regression equation. Multiple Regression Linear Equation is: \(E(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p\).

37.3: Estimated-Multiple-Regression-Equation

Estimated Multiple Regression Equation is given by \(\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_p x_p\). Where \(b_i\) represents an estimate of the change in y corresponding to a one-unit change in \(x_i\) when all other independent variables are held constant.

37.4: RSq-Adj

If a variable is added to the model, \(R^2\) becomes larger even if the variable added is not statistically significant. The adjusted multiple coefficient of determination \((R_a^2)\) compensates for the number of independent variables in the model. With ‘n’ denoting the number of observations and ‘p’ denoting the number of independent variables: \(R_a^2 = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}\)

37.5: H-MultipleRegression-F

\(\text{\{F-Test in Multiple Linear Regression\} } {H_0} : {\beta}_1 = {\beta}_2 = \cdots = {\beta}_p = 0 \iff {H_a}: \text{At least one parameter is not zero}\)

37.6: H-MultipleRegression-t

\(\text{\{t-Test in Multiple Linear Regression\} } {H_0} : {\beta}_i = 0 \iff {H_a}: {\beta}_i \neq 0\)

37.7: Multicollinearity-c15

Multicollinearity refers to the correlation among the independent variables.

37.8: Dummy-Variables

A categorical variable with \(k\) levels must be modeled using \(k - 1\) dummy variables (or indicator variables). It can take only the values 0 and 1. e.g. A variable with 3 levels of {low, medium, high} would need 2 dummy variables \(\{x_1, x_2\}\) each being either 0 or 1 only. i.e. low \(\to \{x_1 = 1, x_2 = 0\}\), medium \(\to \{x_1 = 0, x_2 = 1\}\), high \(\to \{x_1 = 0, x_2 = 0\}\). Thus \(x_1\) is 1 when low and 0 otherwise, \(x_2\) is 1 when medium and 0 otherwise. High is represented as neither \(x_1\) nor \(x_2\) i.e. both are zero. Note that both cannot be 1. Only one of them can be TRUE at a time.

37.9: Odds-Ratio

The odds ratio measures the impact on the odds of a one-unit increase in only one of the independent variables. The odds ratio is the odds that y = 1 given that one of the independent variables has been increased by one unit \((\text{odds}_1)\) divided by the odds that y = 1 given no change in the values for the independent variables \((\text{odds}_0)\). i.e. \(\text{Odds Ratio} = \frac{\text{odds}_1}{\text{odds}_0}\)

37.10: VIF

The variance inflation factors (VIF) is given by \(\text{VIF}_i = \frac{1}{1 - R_i^2} \in [1, \infty]\). That is, the minimum value for VIF is 1, and is reached when \(x_i\) is completely uncorrelated with the remaining predictors.

37.11: Stepwise-Regression

In stepwise regression, the regression model begins with no predictors, then the most significant predictor is entered into the model, followed by the next most significant predictor. At each stage, each predictor is tested whether it is still significant. The procedure continues until all significant predictors have been entered into the model, and no further predictors have been dropped. The resulting model is usually a good regression model, although it is not guaranteed to be the global optimum.

40.1: Parametric-Methods

Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).

40.2: Distribution-free-Methods

Distribution-free methods are the Statistical methods that make no assumption about the probability distribution of the population.

40.3: Nonparametric-Methods

Nonparametric methods are the statistical methods that require no assumption about the form of the probability distribution of the population and are often referred to as distribution free methods. Several of the methods can be applied with categorical as well as quantitative data.

41.1: Data-Mining-331

Data mining is the process of discovering useful patterns and trends in large data sets.

41.2: Predictive-Analytics-331

Predictive analytics is the process of extracting information from large data sets in order to make predictions and estimates about future outcomes.

41.3: Description

Description of patterns and trends often suggest possible explanations for existence of theme within the data.

41.4: Estimation

In estimation, we approximate the value of a numeric target variable using a set of numeric and/or categorical predictor variables. Methods: Point Estimation, Confidence Interval Estimation, Simple Linear Regression, Correlation, Multiple Regression etc.

41.5: Prediction

Prediction is similar to classification and estimation, except that for prediction, the results lie in the future. Estimation methods are also used for Prediction. Additional Methods: k-nearest neighbor methods, decision trees, neural networks etc.

41.6: Classification

Classification is similar to estimation, however, instead of approximating the value of a numeric target variable, the target variable is categorical.

42.1: Variable-Flag-Dummy

A flag variable (or dummy variable, or indicator variable) is a categorical variable taking only two values, 0 and 1. Ex: Gender (Male, Female) can be recoded into dummy Gender (Male = 0, Female = 1).

44.1: Multicollinearity

Multicollinearity is a condition where some of the predictor variables are strongly correlated with each other.

44.2: Principle-of-Parsimony

Principle of parsimony is the problem-solving principle that “entities should not be multiplied beyond necessity.”

44.3: Overfitting

Overfitting is the production of an analysis that corresponds too closely to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.

44.4: Underfitting

Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data.

44.5: PCA

Principal components analysis (PCA) seeks to explain the correlation structure of a set of predictor variables \({m}\), using a smaller set of linear combinations of these variables, called components \({k}\). PCA acts solely on the predictor variables, and ignores the target variable.

44.6: Eigenvalues

Let \(\mathbf{B}\) be an \(m \times m\) matrix, and let \(\mathbf{I}\) be the \(m \times m\) identity matrix. Then the scalars \(\{\lambda_1, \lambda_2, \ldots, \lambda_m\}\) are said to be the eigenvalues of \(\mathbf{B}\) if they satisfy \(|\mathbf{B} - \lambda \mathbf{I}| = 0\), where \(|\mathbf{Q}|\) denotes the determinant of Q.

44.7: Eigenvector

Let \(\mathbf{B}\) be an \(m \times m\) matrix, and let \({\lambda}\) be an eigenvalue of \(\mathbf{B}\). Then nonzero \(m \times 1\) vector \(\overrightarrow{e}\) is said to be an eigenvector of B, if \(\mathbf{B} \overrightarrow{e} = <U+0001D706>\overrightarrow{e}\).

44.8: Orthogonal

Two vectors are orthogonal if they are mathematically independent, have no correlation, and are at right angles to each other.

44.9: Communality

PCA does not extract all the variance from the variables, but only that proportion of the variance that is shared by several variables. Communality represents the proportion of variance of a particular variable that is shared with other variables. Communality values are calculated as the sum of squared component weights, for a given variable.

45.1: Unsupervised-Methods

In unsupervised methods, no target variable is identified as such. Instead, the data mining algorithm searches for patterns and structures among all the variables. The most common unsupervised data mining method is clustering. Ex: Voter Profile.

45.2: Supervised-Methods

Supervised methods are those in which there is a particular prespecified target variable and the algorithm is given many examples where the value of the target variable is provided. This allows the algorithm to learn which values of the target variable are associated with which values of the predictor variables.

45.3: A-Priori-Hypothesis

An a priori hypothesis is one that is generated prior to a research study taking place.

45.4: Cross-Validation

Cross-validation is a technique for insuring that the results uncovered in an analysis are generalizable to an independent, unseen, data set.

45.5: Bias-variance-Trade-off

Even though the high-complexity model has low bias (error rate), it has a high variance; and even though the low-complexity model has a high bias, it has a low variance. This is known as the bias-variance trade-off. It is another way of describing the overfitting-underfitting dilemma.

45.6: Resampling

Resampling refers to the process of sampling at random and with replacement from a data set. It is discouraged.

46.1: Clustering

Clustering refers to the grouping of records, observations, or cases into classes of similar objects. Clustering differs from classification in that there is no target variable for clustering.

46.2: Cluster

A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters.

46.3: Euclidean-Distance

Euclidean distance between records is given by equation, \(d_{\text{Euclidean}}(x,y) = \sqrt{\sum_i{\left(x_i - y_i\right)^2}}\), where \(x = \{x_1, x_2, \ldots, x_m\}\) and \(y = \{y_1, y_2, \ldots, y_m\}\) represent the \({m}\) attribute values of two records.

46.4: Hierarchical-Clustering

In hierarchical clustering, a treelike cluster structure (dendrogram) is created through recursive partitioning (divisive methods) or combining (agglomerative) of existing clusters.

46.5: Agglomerative-Clustering

Agglomerative clustering methods initialize each observation to be a tiny cluster of its own. Then, in succeeding steps, the two closest clusters are aggregated into a new combined cluster. In this way, the number of clusters in the data set is reduced by one at each step. Eventually, all records are combined into a single huge cluster. mMost computer programs that apply hierarchical clustering use agglomerative methods.

46.6: Divisive-Clustering

Divisive clustering methods begin with all the records in one big cluster, with the most dissimilar records being split off recursively, into a separate cluster, until each record represents its own cluster.

46.7: Single-Linkage

Single linkage, the nearest-neighbor approach, is based on the minimum distance between any record in cluster A and any record in cluster B. Cluster similarity is based on the similarity of the most similar members from each cluster. It tends to form long, slender clusters, which may sometimes lead to heterogeneous records being clustered together.

46.8: Complete-Linkage

Complete linkage, the farthest-neighbor approach, is based on the maximum distance between any record in cluster A and any record in cluster B. Cluster similarity is based on the similarity of the most dissimilar members from each cluster. It tends to form more compact, spherelike clusters.

46.9: Average-Linkage

Average linkage is designed to reduce the dependence of the cluster-linkage criterion on extreme values, such as the most similar or dissimilar records. The criterion is the average distance of all the records in cluster A from all the records in cluster B. The resulting clusters tend to have approximately equal within-cluster variability. In general, average linkage leads to clusters more similar in shape to complete linkage than does single linkage.

47.1: Cluster-Separation

Cluster separation represents how distant the clusters are from each other.

47.2: Cluster-Cohesion

Cluster cohesion refers to how tightly related the records within the individual clusters are. SSE accounts only for cluster cohesion.

47.3: Silhouette

The silhouette is a characteristic of each data value. For each data value i,

47.3: Silhouette

\(\text{Silhouette}_i = s_i = \frac{b_i - a_i}{\text{max}(b_i, a_i)} \to s_i \in [-1, 1]\), where \(a_i\) is the distance between the data value (Cohesion) and its cluster center, and \(b_i\) is the distance between the data value and the next closest cluster center (Separation).

47.4: pseudo-F

The pseudo-F statistic is measures the ratio of (i) the separation between the clusters, as measured by the mean square between the clusters (MSB), to (ii) the spread of the data within the clusters as measured by the mean square error (MSE). i.e. \(F_{k-1, N-k} = \frac{\text{MSB}}{\text{MSE}} = \frac{\text{SSB}/{k-1}}{\text{SSE}/{N-k}}\)

48.1: Affinity-Analysis

Affinity analysis, (or Association Rules or Market Basket Analysis), is the study of attributes or characteristics that “go together.” It seeks to uncover rules for quantifying the relationship between two or more attributes. Association rules take the form "If antecedent, then consequent", along with a measure of the support and confidence associated with the rule.

48.2: Support

The support (s) for a particular association rule \(A \Rightarrow B\) is the proportion of transactions in the set of transactions D that contain both antecedent A and consequent B. Support is Symmetric. \(\text{Support} = P(A \cap B) = \frac{\text{Number of transactions containing both A and B}}{\text{Total Number of Transactions}}\)

48.3: Confidence

The confidence (c) of the association rule \(A \Rightarrow B\) is a measure of the accuracy of the rule, as determined by the percentage of transactions in the set of transactions D containing antecedent A that also contain consequent B. Confidence is Asymmetric \(\text{Confidence} = P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{\text{Number of transactions containing both A and B}}{\text{Total Number of Transactions containing A}}\)

48.4: Itemset

An itemset is a set of items contained in I, and a k-itemset is an itemset containing k items. For example, {Potato, Tomato} is a 2-itemset, and {Potato, Tomato, Onion} is a 3-itemset, each from the vegetable stand set I.

48.5: Itemset-Frequency

The itemset frequency is simply the number of transactions that contain the particular itemset.

48.6: Frequent-Itemset

A frequent itemset is an itemset that occurs at least a certain minimum number of times, having itemset frequency \(\geq \phi\). We denote the set of frequent k-itemsets as \(F_k\).

48.7: A-Priori-Property

a priori property: If an itemset Z is not frequent, then for any item A, \(Z \cup A\) will not be frequent. In fact, no superset of Z (itemset containing Z) will be frequent.

48.8: Lift

Lift is a measure that can quantify the usefulness of an association rule. Lift is Symmetric. \(\text{Lift} = \frac{\text{Rule Confidence}}{\text{Prior proportion of Consequent}}\)

ERRORS

1.1: cannot-open-connection

Error in file(file, ifelse(append, "a", "w")) : cannot open the connection

1.2: need-finite-xlim

Error in plot.window(...) : need finite ’xlim’ values

1.3: par-old-par

Error in par(old.par) : invalid value specified for graphical parameter "pin"

2.1: plot-finite-xlim

Error in plot.window(...) : need finite ’xlim’ values

2.2: Function-Not-Found

Error in arrange(bb, day) : could not find function "arrange"

3.1: Object-Not-Found-01

Error in match.arg(method) : object ’day’ not found

3.2: Comparison-possible

Error in day == 1 : comparison (1) is possible only for atomic and list types

3.3: UseMethod-No-applicable-method

Error in UseMethod("select") : no applicable method for ’select’ applied to an object of class "function"

3.4: Object-Not-Found-02

Error: Problem with mutate() column ... column object ’arr_delay’ not found

16.1: plot-new

Error in ... : plot.new has not been called yet

16.2: non-numeric-argument

Error in plot(...) : non-numeric argument to binary operator

16.3: Not-Exported-Object

Error: ’plot’ is not an exported object from ’namespace:arulesViz’

17.1: lm-non-numeric-y

Warning messages: In model.response(mf, "numeric") : using type = "numeric" with a factor response will be ignored. In Ops.factor(y, ...) : ’-’ not meaningful for factors

24.1: gg-stat-count-geom-bar

Error: stat_count() can only have an x or y aesthetic.

28.1: ggplot-list

Error in is.finite(x) : default method not implemented for type ’list’

28.2: ggplot-data

Error: Must subset the data pronoun with a string.

32.1: shapiro-limit

Error in shapiro.test(...) : sample size must be between 3 and 5000

32.2: t-test-grouping

Error in t.test.formula() : grouping factor must have exactly 2 levels

42.1: Insufficient-Data

Error: Insufficient data values to produce ... bins.

43.1: stat-count-xy

Error: stat_count() can only have an x or y aesthetic.

43.2: stat-count-y

Error: stat_count() must not be used with a y aesthetic.

44.1: CorMat

Error in if (prod(R2) < 0) : missing value where TRUE/FALSE needed